• Inverted Page Tables

    From Robert Finch@robfi680@gmail.com to comp.arch on Thu Jan 29 10:33:29 2026
    From Newsgroup: comp.arch

    Have inverted page tables fallen out of favor? And how much OS support
    is there for them?

    I am thinking about using an inverted page table instead of a
    hierarchical page table for the current project. The advantage is that
    the entire page table should be able to fit into BRAM on the FPGA. It
    should then be possible to eliminate the TLB from the design at perhaps
    a small performance cost. It would reduce the LUT cost of the MMU.

    Translations could work directly from the page table.

    The hash function would select a page table group. Averaging about 1.2
    lookups per memory access. Using the ASID xorrCOd with address bits 20 to
    29 as the hash. 32k entries in the table.

    Using 64kB pages to keep the page table small. 1GB RAM to manage.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Jan 29 18:04:44 2026
    From Newsgroup: comp.arch

    On Thu, 29 Jan 2026 10:33:29 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    Have inverted page tables fallen out of favor?

    It depends on whom do you ask.
    If you ask LBT then inverted page tables never had an opportunity to
    fall out his favor, because he hated them since the very first
    encounter.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Jan 29 13:09:37 2026
    From Newsgroup: comp.arch

    I am thinking about using an inverted page table instead of a hierarchical page table for the current project. The advantage is that the entire page table should be able to fit into BRAM on the FPGA. It should then be possible to eliminate the TLB from the design at perhaps a small performance cost. It would reduce the LUT cost of the MMU.

    If you have the freedom (on the OS side) to use a virtually-indexed
    cache, I'd recommend you try and use a system like:

    PA physical_address_of_virtual_address (VA x)
    {
    if (x == 0)
    return 0;
    else {
    VA pagetable_entry = (x / PAGESIZE * sizeof (VA));
    return *(PA*)pagetable_entry;
    }
    }

    This uses the CPU cache as the TLB and does pagetable walks "from the
    bottom", so its efficiency is basically not affected by the number of
    levels of pagetables.


    === Stefan
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 29 21:40:27 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Have inverted page tables fallen out of favor?

    Not so much as: that "they were never really IN Favor".

    And how much OS support
    is there for them?

    Probably just enough to get buy if you copy an existing IPT.

    I am thinking about using an inverted page table instead of a
    hierarchical page table for the current project. The advantage is that
    the entire page table should be able to fit into BRAM on the FPGA. It
    should then be possible to eliminate the TLB from the design at perhaps
    a small performance cost. It would reduce the LUT cost of the MMU.

    An alternative is for HW to generate a hash address based on faulting
    VA and make one probe into a 1-16MB TLB stored across the memory hierarchy.
    If the access matches VA the rest of the data is your PTE. If not, then
    you trap to SW to refill the memory-TLB. Rational:: if the hashed address
    is not that bad, and the hash table is large, TLB refills end up at low
    cost (delay).

    Translations could work directly from the page table.

    The hash function would select a page table group. Averaging about 1.2 lookups per memory access. Using the ASID xorrCOd with address bits 20 to
    29 as the hash. 32k entries in the table.

    Using 64kB pages to keep the page table small. 1GB RAM to manage.

    But realistically, a HW table-walker with a bit of internal caching supplemented with a large L2-TLB works fine in practice.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Jan 29 17:36:24 2026
    From Newsgroup: comp.arch

    On 1/29/2026 9:33 AM, Robert Finch wrote:
    Have inverted page tables fallen out of favor? And how much OS support
    is there for them?

    I am thinking about using an inverted page table instead of a
    hierarchical page table for the current project. The advantage is that
    the entire page table should be able to fit into BRAM on the FPGA. It
    should then be possible to eliminate the TLB from the design at perhaps
    a small performance cost. It would reduce the LUT cost of the MMU.

    Translations could work directly from the page table.

    The hash function would select a page table group. Averaging about 1.2 lookups per memory access. Using the ASID xorrCOd with address bits 20 to
    29 as the hash. 32k entries in the table.

    Using 64kB pages to keep the page table small. 1GB RAM to manage.


    Inverted page tables kinda have the disadvantage though of having *both*
    the disadvantages of a software managed TLB along with some of the disadvantages of a HW page walker.

    The main argued benefits they have (over plain SW TLB):
    Typically larger (because RAM backed);
    Potentially slightly faster on process switches.
    Less of a "TLB miss storm".

    Common drawbacks with SW TLB:
    Still need to deal with TLB misses (if page isn't in IPT);
    Finite associativity means can't fit whole VAS in an IPT.

    Common drawbacks with HW page walking:
    Still need to access RAM, potentially multiple times, to resolve a miss;
    Loss of flexibility: As with HW page walking, one can't transparently
    deal with multi-level address translation (without, essentially, falling
    back to just treating it like a SW TLB; or having whatever code manages
    the bottom-level IPT also being aware of the use of nested address translation).

    Though, maybe not as bad as with HW page walking, where the page-walker
    also has to be aware of and natively deal with nested translation
    (potentially a very big headache for a HW page walker).



    Big set-associative IPT held in BRAM (vs in Main RAM):
    Basically the same as how I am already doing SW TLB.

    While it could be argued that an IPT can reduce process-switch cost by
    having per-process IPT, with some cleverness this can also be faked on
    top of SW TLB (by modeling the TLB per-process, and then proactively
    injecting TLBE's on process switch).




    One other intermediate idea I had considered was a "Last Level Page
    Table", where the MMU is given the address of the last-level of each page-table as its own specialized TLBE (after any requisite address translation), which can potentially have some of the advantages of both
    SW TLB and HW page walking:
    Retains flexibility and can deal with nested translation;
    Avoids many of the harder parts of a page walker;
    Only needs a single memory access to resolve a request.
    Could have some of the benefits of a HW page-walker:
    Most of the would-be TLB misses can be handled by hardware.

    Though, admittedly, haven't implemented it as of yet, but mostly because
    in my case TLB miss activity tends to be low enough that it isn't a big
    enough part of the total clock cycle budget (or, IOW, 256x4W already has
    a pretty good hit rate in a world of mostly smallish processes).

    Would have the issue that it can no longer detect success or failure
    within a fixed latency, so would need more complexity to deal with an "in-flight potential miss", either needing a mechanism to signal the
    request to bounce around the ring some more, or a way to temporarily
    take requests off the bus and inject them back later.


    Though, the most likely option being that untranslated VAS's are not
    allowed to leave the L1 ringbus (so would be deflected back into the
    ring until the TLB can deal with them). Does lead to the potential issue
    of bus congestion leading to a scenario where the page-access response
    (from the L2 cache or RAM) can't easily enter back into the L1 ring if
    most of the spots are already full (with in-flight cycling requests).

    Though, then again, for latency:
    L1 D$: 2c
    L1 I$: 2c
    TLB: 3c
    BridgeVcA: 3c
    This is a 64x4W write-through cache that also functions as a bridge.

    So, L1A has ~ 10c ring latency.
    Assuming both L1's fire off 2 requests, this is around a 40% congestion,
    and theoretically should not break down until around 70% (unless the
    latency values have an unlucky fraction, *).

    Goes and checks:
    Needed logic was already in place, though as-is it would print debug
    messages whenever the request is bounced back inside the L1 ring (an untranslated L1 request escaping the L1 ring would be "extra bad", as
    only physically addressed requests are allowed on the L2 ring).

    *: If the L2 ring has a latency that is an integer multiple of the L1
    ring latency, then potentially the same messages can collide endlessly,
    and the bus would effectively be stalled. In theory, could be addressed
    via a mechanism to either fine-adjust or jitter L2 ring latency if needed.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Jan 30 11:16:43 2026
    From Newsgroup: comp.arch

    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS support
    is there for them?

    IVTs are a terrible idea. Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    - Dan C.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Jan 30 16:03:39 2026
    From Newsgroup: comp.arch

    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS support >>>> is there for them?

    IVTs are a terrible idea. Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked
    in hardware very quickly, one cycle per step as opposed to walking the
    page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for a TLB. >>
    For the IVT implementation the table is twice the size needed to cover
    the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW that
    cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages can
    point to a single physical page using a a hash table. Is it just a
    performance issue? ASID is part of the hash/compare.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is
    walked on a miss instead of walking main memory. Page faults would be
    handled the same way.

    The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
    begins using quadratic open addressing.



    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for instance. >>
    A hierarchical page table is a better solution, but I was looking for
    something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Jan 30 15:44:45 2026
    From Newsgroup: comp.arch

    On 1/30/2026 3:03 PM, Robert Finch wrote:
    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch-a <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS support >>>>> is there for them?

    IVTs are a terrible idea.-a Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    -a-a-a-a- Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked
    in hardware very quickly, one cycle per step as opposed to walking the
    page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for
    a TLB.

    For the IVT implementation the table is twice the size needed to cover
    the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW that >>> cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages can point to a single physical page using a a hash table. Is it just a performance issue? ASID is part of the hash/compare.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is walked on a miss instead of walking main memory. Page faults would be handled the same way.

    The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
    begins using quadratic open addressing.


    It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.


    Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle:
    while on average hashing is better than modulo addressing, the worst
    case behavior for hashing is worst than the worst case for modulo.

    Effectively, hashing requires twice the associativity to not shoot
    itself in the foot in the worst case, whereas modulo is merely slightly
    worse in the average case.

    Ironically, also means that while small-scale alignment is often
    preferable, medium and large scale alignment is best minimized (or, IOW,
    ASLR can actually be beneficial for performance by reducing unrelated
    data stomping the same locations in the caches and TLB).


    Though, there are intermediate options, say, for a TLB, XOR'ing the
    index with the ASID for non-global pages, *1; but then this means
    needing to partition the TLB between global and local (halving
    associativity relative to LUT budget), in turn still favoring the use of modulo addressing.


    Granted, a hashing scheme and multiple probing steps could be possible
    to increase effective associativity at the cost of latency (say, in an
    extreme case, using latency tricks to cause a 2-way TLB to pretend to be
    4 or 8 way).

    Though, with something like my ringbus design this would require needing
    to get clever to allow it to work (as it could not translate requests on multiple adjacent slots). Would likely need, say, a small 1-way TLB on
    the front-end, followed by 2 or 3 additional slots to resolve the lookup sequentially if the 1-way TLB misses (where multiple subsequent requests
    that miss on the 1-way TLB and were shadowed by a prior request taking
    up the "multi-way probe" resources, would need to cycle back around the
    ring for another pass through the TLB, *2).


    *1: Or, semi-global in a MMU that lacks true global pages, as the
    existence of true global pages are actively detrimental in some
    scenarios (and it is preferable instead to divide ASIDs into groups, and
    a global page is only global within a group or within related groups;
    allowing for ASIDs to exist which will not see these "global" pages).

    *2: And, while the ringbus continues to seem high-latency and stupid,
    still seems to be one of the better options to balance performance and cheapness (simpler bus designs being painfully slow, and more advanced FIFO/queue based buses being far more complex and expensive...).




    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for
    instance.

    A hierarchical page table is a better solution, but I was looking for
    something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Jan 30 19:23:56 2026
    From Newsgroup: comp.arch

    On 2026-01-30 4:44 p.m., BGB wrote:
    On 1/30/2026 3:03 PM, Robert Finch wrote:
    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch-a <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS
    support
    is there for them?

    IVTs are a terrible idea.-a Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    -a-a-a-a- Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for
    a TLB.

    For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW
    that
    cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages
    can point to a single physical page using a a hash table. Is it just a
    performance issue? ASID is part of the hash/compare.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is
    walked on a miss instead of walking main memory. Page faults would be
    handled the same way.

    The table is clustered, the hash points to eight entries which are
    then searched in parallel for the correct one. If not found a table
    walk begins using quadratic open addressing.


    It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.

    It is not a lot different. On a translation miss the table is walked
    with HW though instead of SW walking main memory. There are enough
    entries in the table to cover the physical address range. Only page
    faults are handled in SW which should be rare. A page fault is generated
    after 30 steps in the walk though. That is a miss on 240 possible
    locations. 240-way associativity is not enough?
    The page fault handler will like just report 'out-of-memory' then quit
    the app.


    Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle:
    while on average hashing is better than modulo addressing, the worst
    case behavior for hashing is worst than the worst case for modulo.

    Effectively, hashing requires twice the associativity to not shoot
    itself in the foot in the worst case, whereas modulo is merely slightly worse in the average case.

    I think clustering give the table more associativity. Up to eight VAs
    can all hash to the same cluster. Almost as if it were eight-way
    associative.

    I think clustering is key to using an inverted hash-table. I got this
    from studying the PowerPC.


    Ironically, also means that while small-scale alignment is often
    preferable, medium and large scale alignment is best minimized (or, IOW, ASLR can actually be beneficial for performance by reducing unrelated
    data stomping the same locations in the caches and TLB).


    Though, there are intermediate options, say, for a TLB, XOR'ing the
    index with the ASID for non-global pages, *1; but then this means
    needing to partition the TLB between global and local (halving
    associativity relative to LUT budget), in turn still favoring the use of modulo addressing.


    Granted, a hashing scheme and multiple probing steps could be possible
    to increase effective associativity at the cost of latency (say, in an extreme case, using latency tricks to cause a 2-way TLB to pretend to be
    4 or 8 way).

    Not too worried about latency as long as it is reasonable. Every step is another eight ways associative due to clustering.


    Though, with something like my ringbus design this would require needing
    to get clever to allow it to work (as it could not translate requests on multiple adjacent slots). Would likely need, say, a small 1-way TLB on
    the front-end, followed by 2 or 3 additional slots to resolve the lookup sequentially if the 1-way TLB misses (where multiple subsequent requests that miss on the 1-way TLB and were shadowed by a prior request taking
    up the "multi-way probe" resources, would need to cycle back around the
    ring for another pass through the TLB, *2).


    *1: Or, semi-global in a MMU that lacks true global pages, as the
    existence of true global pages are actively detrimental in some
    scenarios (and it is preferable instead to divide ASIDs into groups, and
    a global page is only global within a group or within related groups; allowing for ASIDs to exist which will not see these "global" pages).

    *2: And, while the ringbus continues to seem high-latency and stupid,
    still seems to be one of the better options to balance performance and cheapness (simpler bus designs being painfully slow, and more advanced FIFO/queue based buses being far more complex and expensive...).





    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for
    instance.

    A hierarchical page table is a better solution, but I was looking for
    something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.




    Using a huge page size (256kB) now so the table can be kept small. There
    are three hash tables, one for instructions, two for data. Same as
    triple porting the table. There are still thousands of pages available.





    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jan 31 01:36:59 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 4:44 p.m., BGB wrote:
    On 1/30/2026 3:03 PM, Robert Finch wrote:
    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch-a <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS
    support
    is there for them?

    IVTs are a terrible idea.-a Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    -a-a-a-a- Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash >>>> table based one. I tend to think of them as the same thing. I do not >>>> think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for >>>> a TLB.

    For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following >>>> the challenge to using them for shared memory? Multiple VPN-> to the >>>> same PPN are possible. Is it the freeing up of physical pages in SW >>>> that
    cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages
    can point to a single physical page using a a hash table. Is it just a
    performance issue? ASID is part of the hash/compare.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is
    walked on a miss instead of walking main memory. Page faults would be
    handled the same way.

    The table is clustered, the hash points to eight entries which are
    then searched in parallel for the correct one. If not found a table
    walk begins using quadratic open addressing.


    It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.

    It is not a lot different. On a translation miss the table is walked
    with HW though instead of SW walking main memory. There are enough
    entries in the table to cover the physical address range. Only page
    faults are handled in SW which should be rare. A page fault is generated after 30 steps in the walk though. That is a miss on 240 possible
    locations. 240-way associativity is not enough?
    The page fault handler will like just report 'out-of-memory' then quit
    the app.


    Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle: while on average hashing is better than modulo addressing, the worst
    case behavior for hashing is worst than the worst case for modulo.

    Effectively, hashing requires twice the associativity to not shoot
    itself in the foot in the worst case, whereas modulo is merely slightly worse in the average case.

    I think clustering give the table more associativity. Up to eight VAs
    can all hash to the same cluster. Almost as if it were eight-way associative.

    I think clustering is key to using an inverted hash-table. I got this
    from studying the PowerPC.


    Ironically, also means that while small-scale alignment is often preferable, medium and large scale alignment is best minimized (or, IOW, ASLR can actually be beneficial for performance by reducing unrelated
    data stomping the same locations in the caches and TLB).


    Though, there are intermediate options, say, for a TLB, XOR'ing the
    index with the ASID for non-global pages, *1; but then this means
    needing to partition the TLB between global and local (halving associativity relative to LUT budget), in turn still favoring the use of modulo addressing.


    Granted, a hashing scheme and multiple probing steps could be possible
    to increase effective associativity at the cost of latency (say, in an extreme case, using latency tricks to cause a 2-way TLB to pretend to be
    4 or 8 way).

    Not too worried about latency as long as it is reasonable. Every step is another eight ways associative due to clustering.

    Be careful !!

    Matrix300 (SPEC89) uses all 4 transposes of DGEMM. The 4th transpose takes
    a TLB miss every 4th Load; or once every 2-cycles on the 6-wide machine.
    A 32-entry fully associative TLB would miss every other cycle, while a 256-entry Direct Mapped TLB would never take a miss.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Jan 31 12:31:36 2026
    From Newsgroup: comp.arch

    Robert Finch wrote:
    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS support >>>>> is there for them?

    IVTs are a terrible idea. Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked
    in hardware very quickly, one cycle per step as opposed to walking the
    page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for
    a TLB.

    For the IVT implementation the table is twice the size needed to cover
    the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW that >>> cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages can point to a single physical page using a a hash table. Is it just a performance issue? ASID is part of the hash/compare.

    Be careful about ASID's - they aren't just part of the virtual address
    in MMU's like x86. There is a logic aspect in the address compare.
    If the PTE Global flag is set then ASID is NOT included in the VA compare,
    if the PTE Global flag is clear then ASID is included in the VA compare.

    If the G flag is set on OS kernel pages then the effect is to make the
    kernel virtual pages one single address space on all cores with no
    associated ASID, while process virtual spaces reside in their own ASID's.

    The above is easy to implement in a CAM because the Global flag is
    part of the PTE entry. Compare the virtual page number (VPN) and if
    there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

    But if you just tack the ASID onto the VA to get a larger VA and
    use that as an index then you don't get the same logic because
    the PTE.G flag that tells you whether ASID matters for that PTE,
    is on the PTE that you are looking up.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is walked on a miss instead of walking main memory. Page faults would be handled the same way.

    Yes.
    Long ago I read a paper about a port of Linux to a PowerPC with its IVT.
    Linux memory management is based on a hierarchical page table (HPT) so
    rather that rewrite the memory manager they treated the IVT as a large TLB.
    As pages were touched they copied the PTE's from the HPT into IVT.
    When there was a process switch they purged the IVT of all entries
    the same way one might purge a TLB. Later they added ASID's so they
    didn't have to purge IVT on address space switch.

    The code that copied the PTE's from HPT to IVT was very similar
    to a SW managed TLB, as VM Virtual Machines also do if they don't
    have nested page tables.

    Also you can't guarantee that your whole program will fit into an single inverted/hash table of a predetermined size because you don't know in
    advance which entries will collide and push other entries out.
    Even if it does fit today, any change to the program could break that.


    The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
    begins using quadratic open addressing.



    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for
    instance.

    A hierarchical page table is a better solution, but I was looking for
    something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

    What is your MMU currently doing that uses 9000 LUTs?

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Jan 31 16:29:07 2026
    From Newsgroup: comp.arch

    EricP wrote:
    Robert Finch wrote:
    ??? I think a hash table has this characteristic. Multiple VA pages
    can point to a single physical page using a a hash table. Is it just a
    performance issue? ASID is part of the hash/compare.

    Be careful about ASID's - they aren't just part of the virtual address
    in MMU's like x86. There is a logic aspect in the address compare.
    If the PTE Global flag is set then ASID is NOT included in the VA compare,
    if the PTE Global flag is clear then ASID is included in the VA compare.

    If the G flag is set on OS kernel pages then the effect is to make the
    kernel virtual pages one single address space on all cores with no
    associated ASID, while process virtual spaces reside in their own ASID's.

    The above is easy to implement in a CAM because the Global flag is
    part of the PTE entry. Compare the virtual page number (VPN) and if
    there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

    But if you just tack the ASID onto the VA to get a larger VA and
    use that as an index then you don't get the same logic because
    the PTE.G flag that tells you whether ASID matters for that PTE,
    is on the PTE that you are looking up.

    One could achieve the same effect as the CAM approach by picking off the
    range of addresses assigned to the OS and assigning them to ASID = 0,
    and having the process range use the curent non-zero ASID register.

    If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
    and we added a 12-bit ASID then that gives a 76-bit virtual space
    with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
    1..4095 are for processes.

    The MMU has a register MMU_OS_START that hold the OS unsigned lower
    start address and a comparator. If the VA is >= MMU_OS_START then
    it uses an ASID = 0, otherwise it comes from the MMU_ASID register.

    This 12-bit value is concatenated with the 52-bit VPN to give a
    64-bit combined ASID-VPN index value to lookup in the TLB.

    The OS software that assigned ASIDs also skips assigning 0 to any process.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Jan 31 17:03:54 2026
    From Newsgroup: comp.arch

    On 2026-01-31 12:31 p.m., EricP wrote:
    Robert Finch wrote:
    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch-a <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS
    support
    is there for them?

    IVTs are a terrible idea.-a Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    -a-a-a - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for
    a TLB.

    For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW
    that
    cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages
    can point to a single physical page using a a hash table. Is it just a
    performance issue? ASID is part of the hash/compare.

    Be careful about ASID's - they aren't just part of the virtual address
    in MMU's like x86. There is a logic aspect in the address compare.
    If the PTE Global flag is set then ASID is NOT included in the VA compare,
    if the PTE Global flag is clear then ASID is included in the VA compare.

    If the G flag is set on OS kernel pages then the effect is to make the
    kernel virtual pages one single address space on all cores with no
    associated ASID, while process virtual spaces reside in their own ASID's.

    A global bit is not being used. The address space is divided in two by
    the MSB. The kernel space has the MSB set and is unmapped. Previously I
    used an ASID of zero to represent the kernel address space. The ASID is
    not being used as an index.

    The above is easy to implement in a CAM because the Global flag is
    part of the PTE entry. Compare the virtual page number (VPN) and if
    there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

    But if you just tack the ASID onto the VA to get a larger VA and
    use that as an index then you don't get the same logic because
    the PTE.G flag that tells you whether ASID matters for that PTE,
    is on the PTE that you are looking up.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is
    walked on a miss instead of walking main memory. Page faults would be
    handled the same way.

    Yes.
    Long ago I read a paper about a port of Linux to a PowerPC with its IVT. Linux memory management is based on a hierarchical page table (HPT) so
    rather that rewrite the memory manager they treated the IVT as a large TLB. As pages were touched they copied the PTE's from the HPT into IVT.
    When there was a process switch they purged the IVT of all entries
    the same way one might purge a TLB. Later they added ASID's so they
    didn't have to purge IVT on address space switch.

    The code that copied the PTE's from HPT to IVT was very similar
    to a SW managed TLB, as VM Virtual Machines also do if they don't
    have nested page tables.

    I am not sure about getting Linux working. I may just use my own OS.
    Skimming through the mmap() code has given me an idea of just how
    complex the software is.

    Also you can't guarantee that your whole program will fit into an single inverted/hash table of a predetermined size because you don't know in
    advance which entries will collide and push other entries out.
    Even if it does fit today, any change to the program could break that.

    Collisions will not push other entries out, they just bounce to a
    different empty entry. If one cannot be found then it is an out-of-memory.

    Testing some. Testing just allocates pages without freeing them so it is
    a bit unrealistic.
    The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
    set). Average number of bounces: 1.8 per lookup. This was with random
    ASIDs and addresses.
    The table can be clocked at twice the CPU rate.

    I have been wondering how to collect and use stats to control things.

    The entries in the IVT for ASIDs no longer in use could be cleared automatically by a background HW process.


    The table is clustered, the hash points to eight entries which are
    then searched in parallel for the correct one. If not found a table
    walk begins using quadratic open addressing.



    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for
    instance.

    A hierarchical page table is a better solution, but I was looking for
    something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

    What is your MMU currently doing that uses 9000 LUTs?

    I tried to over build it originally. It currently has a triple ported L2
    TLB so that L2 TLB entries can be used directly to form addresses on a
    miss. This trims at least one clock cycle off the miss time for L1. The
    L1 TLB is fully associative and triple ported as well. L2 is also split
    into two TLBs one for regular pages and a second TLB for huge pages.
    There is a triple-ported miss queue. The TLB hit logic is a bit scary.
    Then there is the logic to update the accessed and modified flags of the
    TLB entry.

    I have been working on the TLB to make L2 single ported and the miss
    queue single ported as well. That should reduce both the LUT cost and
    BRAM cost. It will reduce performance a little bit if there are
    concurrent misses happening.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Sun Feb 1 15:28:47 2026
    From Newsgroup: comp.arch

    In article <10lji2u$2haq2$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    On 2026-01-30 4:44 p.m., BGB wrote:
    On 1/30/2026 3:03 PM, Robert Finch wrote:

    [ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
    be searched before moving to the next entry].

    It isn't obvious (at least to me) how what you are describing is much
    different from a SW managed TLB.

    It is not a lot different. On a translation miss the table is walked
    with HW though instead of SW walking main memory. There are enough
    entries in the table to cover the physical address range. Only page
    faults are handled in SW which should be rare. A page fault is generated >after 30 steps in the walk though. That is a miss on 240 possible
    locations. 240-way associativity is not enough?
    The page fault handler will like just report 'out-of-memory' then quit
    the app.

    I'm going to guess your physical address space is just 32-bits, otherwise
    this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
    44-bit physical address space would mean 2^28 entries, with each entry
    being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
    entries of 64-bits each.

    A page table which doesn't scale is problematic. Especially if it needs
    OS support. Who wants to invest effort in a dead-end idea?

    Your hashed page table has issues which I think you should address:

    1) Supporting multiple page sizes is a pain. If you have 2MB pages and
    64KB pages, there are 32 64KB pages for each 2MB page. So do you
    lazily load the hashtable only on misses, or load all 32 aliases
    at once?

    2) It's common to just do a single hash lookup, then trap to software. Doing
    probing is a big pain. The problems come from removing entries.
    If you page out a page, it leaves gaps in the table. When you do
    a hardware lookup, and find no valid pages at the first entry, this
    doesn't tell you anything! The obvious solution is ALL hardware
    lookups must do the maximum searching before declaring a hash-table
    miss--which you say is 30 steps, checking 240 entries. I'm not sure
    there's a better solution than that, but this is a pretty big
    penalty for a TLB miss.

    3) It's very complex to efficiently update the hash table. On inserting
    a new entry, where do you put it? On one hand, a "move-to-front"
    strategy seems like a good idea--newly added pages seem like they
    should get lookup priority. But how do you do this? It looks like
    each software handled miss has a TON of work to do. It seems like
    you need to re-hash each entry you consider moving, to make sure it
    stays on its proper hash chain. And keeping the OS page structures
    in sync looks like even more work.

    4) The global nature of the hash table means multiple threads probably need
    a lock to change it. So misses to software have to get a lock before
    making any hashtable changes. This is very expensive. This is
    especially bad when a process ends--all of its translations need to
    be removed, and so all page table misses are blocked while that
    happens. This is bad.

    A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash
    of ASID+VA, and does a single probe. It might be useful to do two hash
    tables: one for large pages, and one for small pages (with 2 lookups).
    On any miss, the hash entry to be updated is easy to find for software.
    But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
    hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you
    reverse bits and do other stuff that may be hard for software to calculate,
    you make it easy).

    This gives you the benefits of a hardware table walker with less trouble.

    An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
    is not much of an issue if the primary TLB is >= 256 entries, but it's
    really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.

    Kent
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Feb 1 12:49:53 2026
    From Newsgroup: comp.arch

    Robert Finch wrote:
    On 2026-01-31 12:31 p.m., EricP wrote:
    Robert Finch wrote:

    The hash table is only slightly different than a giant TLB. The TLB
    is walked on a miss instead of walking main memory. Page faults would
    be handled the same way.

    Yes.
    Long ago I read a paper about a port of Linux to a PowerPC with its IVT.
    Linux memory management is based on a hierarchical page table (HPT) so
    rather that rewrite the memory manager they treated the IVT as a large
    TLB.
    As pages were touched they copied the PTE's from the HPT into IVT.
    When there was a process switch they purged the IVT of all entries
    the same way one might purge a TLB. Later they added ASID's so they
    didn't have to purge IVT on address space switch.

    The code that copied the PTE's from HPT to IVT was very similar
    to a SW managed TLB, as VM Virtual Machines also do if they don't
    have nested page tables.

    I am not sure about getting Linux working. I may just use my own OS. Skimming through the mmap() code has given me an idea of just how
    complex the software is.

    It was just an example of how one can use a page table to emulate
    any other kind of page table.

    Also you can't guarantee that your whole program will fit into an single
    inverted/hash table of a predetermined size because you don't know in
    advance which entries will collide and push other entries out.
    Even if it does fit today, any change to the program could break that.

    Collisions will not push other entries out, they just bounce to a
    different empty entry. If one cannot be found then it is an out-of-memory.

    Testing some. Testing just allocates pages without freeing them so it is
    a bit unrealistic.
    The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
    set). Average number of bounces: 1.8 per lookup. This was with random
    ASIDs and addresses.
    The table can be clocked at twice the CPU rate.

    One might make the hash function controllable so you can optimize the
    bounce length for different programs. Also the max bounce limit should
    be a dynamic HW register and not hard wired.

    I have been wondering how to collect and use stats to control things.

    The entries in the IVT for ASIDs no longer in use could be cleared automatically by a background HW process.

    That implies some kind of ASID allocator to track which are in-use.
    It also implies a whole IVT scan as each ASID is freed,
    or maintain a pending free list and a scavenger process.

    Simpler to have create process assign a new ASID by incrementing
    an unsigned up counter that starts at 1. When you hit its max then reset
    to 1 and clear the whole IVT for all non-zero ASID entries in one pass.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Feb 1 16:12:34 2026
    From Newsgroup: comp.arch

    On 1/30/2026 7:36 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 4:44 p.m., BGB wrote:
    On 1/30/2026 3:03 PM, Robert Finch wrote:
    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch-a <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS >>>>>>>> support
    is there for them?

    IVTs are a terrible idea.-a Consider the challenge of using them >>>>>>> to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    -a-a-a-a- Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash >>>>>> table based one. I tend to think of them as the same thing. I do not >>>>>> think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked >>>>>> in hardware very quickly, one cycle per step as opposed to walking the >>>>>> page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for >>>>>> a TLB.

    For the IVT implementation the table is twice the size needed to cover >>>>>> the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following >>>>>> the challenge to using them for shared memory? Multiple VPN-> to the >>>>>> same PPN are possible. Is it the freeing up of physical pages in SW >>>>>> that
    cause an issue?

    Consider mmap(file) in multiple different processes at different VAs. >>>>> So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages
    can point to a single physical page using a a hash table. Is it just a >>>> performance issue? ASID is part of the hash/compare.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is >>>> walked on a miss instead of walking main memory. Page faults would be
    handled the same way.

    The table is clustered, the hash points to eight entries which are
    then searched in parallel for the correct one. If not found a table
    walk begins using quadratic open addressing.


    It isn't obvious (at least to me) how what you are describing is much
    different from a SW managed TLB.

    It is not a lot different. On a translation miss the table is walked
    with HW though instead of SW walking main memory. There are enough
    entries in the table to cover the physical address range. Only page
    faults are handled in SW which should be rare. A page fault is generated
    after 30 steps in the walk though. That is a miss on 240 possible
    locations. 240-way associativity is not enough?
    The page fault handler will like just report 'out-of-memory' then quit
    the app.


    Errm, now I am feeling uncertain how implementing this would be, viable...


    Sometimes, it ends up being better to implement the minimum complexity mechanism that can work, and only implement a more complex mechanism if
    there is a significant drawback or inefficiency with the simpler
    mechanism that the more complex mechanism could address.


    Like, I can note my use of a custom RP2 compression scheme:
    It is a byte oriented LZ variant:
    Typically slightly better than LZ4 for general data;
    Typically slightly worse than LZ4 for program binaries.


    I have one use-case where I use RP2:
    For compressing voxel chunks in my BT3 3D engine.

    Each chunk is 16*16*16 blocks, with one of several in-RAM storage schemes:
    Single block type (no index);
    2-16 block types, 4-bit index, 2K/chunk
    17-256 block types, 8 bit index, 4K/chunk
    257-4096: no index, 32K / chunk.

    So, for storage, it flattens out the chunk:
    Simple header;
    List of block types;
    Index data.

    Then RP2 compresses the chunk.

    The previous engine (BT2) had used a LZ+STF+AdRice scheme here, but this
    is slower than a byte-oriented scheme (like RP2 or LZ4), and the
    compression delta is small. So, BT3 originally went with RP2 mostly as
    it was faster.

    Technically, either LZ4 or RP2 could be used, but RP2 notably beats LZ4
    here, as the smallest run encoding ends up used heavily:
    dddddddd-dlllrrr0:
    0-7 literal bytes, 3-10 matched bytes, distance=0..511
    A distance of 0 is invalid, can be used as an escape code.

    The next match size up is 3 bytes: 4-67 match bytes, vs an 8K distance.
    This seems to be the main difference with program binaries, which have
    more of a skew towards short long distance matches, which favor LZ4.

    RP2 though has a stronger advantage when the input data is under 8K.
    Though, it is possible that a 35 byte match over 16K might have been
    better on average (possibly with another match type for "long match at
    short distance", or RLE like patterns).



    I recently evaluated a possible trick of bolting on STF+AdRice after the
    fact (on top of RP2) as a sort of post-compressor entropy stage (jank,
    but limits computational complexity on the encoder side).

    Had then noted a pattern (based on the RP2-compressed size):
    Under 512 bytes: Entropy stage typically makes it bigger.
    This was the dominant size class here.
    1-2K: Entropy helps slightly, but fails to cross a "useful threshold".
    4K+: Mostly unseen.

    Then went and grabbed a bitwise range coder, and tried using this as a
    test, along with something resembling a "mini-LZMA" (LZ + bitwise range
    coder, along similar lines to LZMA).

    Result, none crossed the threshold...
    The threshold: Give an additional 20% size reduction over the use of RP2.

    Might at least get used sometimes if the threshold were 5% or 10%, but
    don't really want to spend the comparably steep cost of an entropy coder
    for a merely 5% or 10% gain.

    Why not Huffman (or Deflate?):
    Because, ironically, for small highly compressible data, Deflate
    actually turns up being worse off here than RP2.

    Here, one wants the entropy-coded compressors to be more, "break glass
    in case of compression emergency".



    Though, that said, bolting an LZMA style range coder onto RP2 would
    still appear to be a moderately strong possibility (in terms of
    compression), even if bitwise range coding is almost impractically slow
    (on a desktop PC, one is hard pressed to push more than around 50 MB/sec through a range coder).


    Though, that said, there is sometimes something in my mind that seems to
    be trying to yell out that the LZMA range coder design holds a clue for greatly increasing the efficiency and learning rate of neural nets (but
    the rest of my mind can't figure out how to combine them; and both
    generic algorithms and back-propagation leave something to be desired...).

    If anything, one could try to go the other way and use a neural-net as a predictive model for a range-coder, but then one would merely be burning
    CPU time.

    Though, did note a phenomenon with backprop neural nets where they were
    using weakly-biased neurons as a sort of memory between training steps
    (in the case of repeating test cases, it would "memorize" the answer and
    then use it to cheat on the following run when given the same inputs).
    Which is in some sense analogous to the normal behavioral pattern in a range-coder.

    Or, basically, the training process was working like "flash cards" and
    when it would get an answer wrong, the training algo would note it and
    then present the same inputs again a few passes later. The NN was then effectively remember inputs and the output adjustment and then repeating
    them again (without actually learning anything), and the net seemingly
    got pretty good at this (not learning the actual task, just sort of
    optimizing itself to be able to cheat at the NN training algorithm's use
    of repetition on wrong answers).


    Where, in my NN experiments, it would appear that near-0 behavior for
    weights may be more relevant than previously expected (and handling them
    in a similar way to DEC/PDP style floating-point may not be ideal). Even
    if it mostly reflects a use-case that is an adverse edge case in NN
    training (but, in a dynamically adapted net, could potentially be a useful/desirable property; working as a sort of working memory; well,
    along with feeding part of the activation outputs back into the inputs,
    etc).

    Well, also in range-coder predictors, the relative adjustment strength
    depends on which direction one is moving and where one is in the space.
    If near 0, then movements away from 0 are rapid, but as one gets
    further away, it gets harder to form an increment (at the upper
    extremes, it becoming impossible to move further). The closer one is
    towards 0, the harder it is to approach 0. Though this is likely an
    artifact of the distortion of what linear adjustments look like within a non-linear space.

    Comparably, the curve of FP values is steeper than that of RC predictors (exponential rather than square).

    Well, some of this could imply using some lookup-tables for weight
    adjustment in the backprop stage (similar to those used for an
    range-coder). Might also need to consider using IEEE style denormals for
    FP8.

    Still feels as if I am missing something.


    Well, there is still the issue that I needed to run the NN training
    using FP16 weights internally, as directly training the FP8 weights
    wasn't stable enough. So, part of the issue is figuring out stable live adaptation on FP8 weights (which does kinda resemble the range-coder scenario). Well, and that (post backprop) the weight-update in this case
    is mostly XOR'ing the sign-bits to figure out which direction to adjust
    the weight (likewise, also resembling the RC update scenario), though
    skipping weight adjustment when the error is 0 (and skipping downward adjustment when the weight is already 0, etc).


    Issue though seems to be that while FP8 works well enough for forward inference, the direct jumps between FP8 values are too large for
    backprop to work with effectively, and it essentially ends up needing
    some sub-ULP bits for finer adjustment (leading to the issue of using
    FP16 for the backprop stage; which would limit practicality for a live adaptation net).

    Also, would need to figure out a good way to replace "correct/incorrect" training with something more like "pain/no-pain" adaptation, but with
    "pain driven adaptation" one is left merely to do backprop using the
    inverse of the previous outputs as the error vector, which is far less effective (and not clearly much better than just using genetic
    algorithms as the training process; except that GAs limit scenarios a
    lot more than the use of positive and negative reinforcement as a
    training algorithm).

    But, the downside of positive/negative reinforcement is getting
    something much beyond simplistic insect-like behaviors.

    ...



    Main difference I think was that I ended up mostly settling on modulo
    addressing rather than hashing, but the reasoning here is more subtle:
    while on average hashing is better than modulo addressing, the worst
    case behavior for hashing is worst than the worst case for modulo.

    Effectively, hashing requires twice the associativity to not shoot
    itself in the foot in the worst case, whereas modulo is merely slightly
    worse in the average case.

    I think clustering give the table more associativity. Up to eight VAs
    can all hash to the same cluster. Almost as if it were eight-way
    associative.

    I think clustering is key to using an inverted hash-table. I got this
    from studying the PowerPC.


    Ironically, also means that while small-scale alignment is often
    preferable, medium and large scale alignment is best minimized (or, IOW, >>> ASLR can actually be beneficial for performance by reducing unrelated
    data stomping the same locations in the caches and TLB).


    Though, there are intermediate options, say, for a TLB, XOR'ing the
    index with the ASID for non-global pages, *1; but then this means
    needing to partition the TLB between global and local (halving
    associativity relative to LUT budget), in turn still favoring the use of >>> modulo addressing.


    Granted, a hashing scheme and multiple probing steps could be possible
    to increase effective associativity at the cost of latency (say, in an
    extreme case, using latency tricks to cause a 2-way TLB to pretend to be >>> 4 or 8 way).

    Not too worried about latency as long as it is reasonable. Every step is
    another eight ways associative due to clustering.

    Be careful !!

    Matrix300 (SPEC89) uses all 4 transposes of DGEMM. The 4th transpose takes
    a TLB miss every 4th Load; or once every 2-cycles on the 6-wide machine.
    A 32-entry fully associative TLB would miss every other cycle, while a 256-entry Direct Mapped TLB would never take a miss.


    In my own testing, it seemed for the main (L2) TLB, assuming modulo
    indexing and SW refill:
    1-way: doesn't work for L2 TLB (works well for an L1 TLB though);
    2-way: bare minimum, but can fail in certain edge cases (*1);
    4-way: Works pretty well;
    8-way: "near perfect" on the associativity front.

    In many contexts, there doesn't seem to be much gain in going over 8-way associativity, and only a modest gain vs 4-way.

    *1: There is a risk for cases where both D$ and I$ miss and both cross
    page boundaries in the access. While this is less of an issue with
    modulo addressing than with hashing (if this happens with hashed
    indexing and 2-way associativity, one is SOL). It seems to squeeze
    through, if barely, with modulo indexing (because no two adjacent pages
    can stomp the same index).


    As noted, hashing is a risk here:
    Makes average better at expense of worst case;
    Effectively makes 4-way the bare minimum;
    Paradoxically, while making the average case better, it increases the associativity required to avoid performance loss in adverse cases (hash collisions).

    Some of the same benefits of hashing can also be realized in the OS via strategic use of ASLR (where ASLR tends to benefit TLB performance in
    this case).


    Can note that 8-way is an improvement over 4-way, but the gains are
    comparably smaller. Likewise, 16-way seems to gain very little over 8-way.

    Once the required associativity is present, more benefit is seen mostly
    by making the TLB bigger (if possible, you really want the working set
    to be able to fit into the TLB).


    But, as noted, the "Last Level Page-Table TLB" scheme I had considered previously could greatly expand the working set of the MMU without the
    cost of a full page walker.

    But, more a case of investing resources into fixing an issue that is not
    (yet) a significant performance issue.

    Granted, can note that both my BT3 engine and Quake 3 Arena tend to
    exceed the working-set size of the MMU, so are prone to a higher
    frequency of TLB miss interrupts. But, fixing this still wonk make
    Quake3 all that playable on a 50MHz CPU.

    ...



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Feb 3 01:03:36 2026
    From Newsgroup: comp.arch

    After getting bogged down testing the hash table, I decided to shelf it
    for now. And...

    I re-wrote Qupls4 TLB for hierarchical page tables. It is much cleaner
    code now. The TLB is taking up only about 600 LUTs and 8 BRAMs now. It
    is also simpler being only a single level TLB. Three copies of the TLB
    would be needed.

    Thinking about adding caching of translation steps to avoid main memory access.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Feb 3 01:23:05 2026
    From Newsgroup: comp.arch

    On 2026-02-01 12:49 p.m., EricP wrote:
    Robert Finch wrote:
    On 2026-01-31 12:31 p.m., EricP wrote:
    Robert Finch wrote:

    The hash table is only slightly different than a giant TLB. The TLB
    is walked on a miss instead of walking main memory. Page faults
    would be handled the same way.

    Yes.
    Long ago I read a paper about a port of Linux to a PowerPC with its IVT. >>> Linux memory management is based on a hierarchical page table (HPT) so
    rather that rewrite the memory manager they treated the IVT as a
    large TLB.
    As pages were touched they copied the PTE's from the HPT into IVT.
    When there was a process switch they purged the IVT of all entries
    the same way one might purge a TLB. Later they added ASID's so they
    didn't have to purge IVT on address space switch.

    The code that copied the PTE's from HPT to IVT was very similar
    to a SW managed TLB, as VM Virtual Machines also do if they don't
    have nested page tables.

    I am not sure about getting Linux working. I may just use my own OS.
    Skimming through the mmap() code has given me an idea of just how
    complex the software is.

    It was just an example of how one can use a page table to emulate
    any other kind of page table.

    Also you can't guarantee that your whole program will fit into an single >>> inverted/hash table of a predetermined size because you don't know in
    advance which entries will collide and push other entries out.
    Even if it does fit today, any change to the program could break that.

    Collisions will not push other entries out, they just bounce to a
    different empty entry. If one cannot be found then it is an out-of-
    memory.

    Testing some. Testing just allocates pages without freeing them so it
    is a bit unrealistic.
    The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
    set). Average number of bounces: 1.8 per lookup. This was with random
    ASIDs and addresses.
    The table can be clocked at twice the CPU rate.

    One might make the hash function controllable so you can optimize the
    bounce length for different programs. Also the max bounce limit should
    be a dynamic HW register and not hard wired.

    The max bounce limit is a HW register. Making the hash function
    controllable is bit tricky as it is used by HW. It might be easier to
    provide a selection of a few different hashes. Hash function reminds me
    of crypto rotates and combinations.


    I have been wondering how to collect and use stats to control things.

    The entries in the IVT for ASIDs no longer in use could be cleared
    automatically by a background HW process.

    That implies some kind of ASID allocator to track which are in-use.
    It also implies a whole IVT scan as each ASID is freed,
    or maintain a pending free list and a scavenger process.

    For the hash table a small ASID (eg 8 bits) is used. It could be tracked
    with a bitmnap.

    If the ASIDs are not reused too quickly the scavenger may not need to be
    fast. A ASID FIFO could be used to allocate / free ASIDs.

    Simpler to have create process assign a new ASID by incrementing
    an unsigned up counter that starts at 1. When you hit its max then reset
    to 1 and clear the whole IVT for all non-zero ASID entries in one pass.


    A garbage collection approach may work for removed entries. It might be
    a complicated state machine though to do in HW. It could look for
    invalid or entries with defunct ASIDs.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 4 20:42:25 2026
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    EricP wrote:
    Robert Finch wrote:
    ??? I think a hash table has this characteristic. Multiple VA pages
    can point to a single physical page using a a hash table. Is it just a
    performance issue? ASID is part of the hash/compare.

    Be careful about ASID's - they aren't just part of the virtual address
    in MMU's like x86. There is a logic aspect in the address compare.
    If the PTE Global flag is set then ASID is NOT included in the VA compare, if the PTE Global flag is clear then ASID is included in the VA compare.

    If the G flag is set on OS kernel pages then the effect is to make the kernel virtual pages one single address space on all cores with no associated ASID, while process virtual spaces reside in their own ASID's.

    The above is easy to implement in a CAM because the Global flag is
    part of the PTE entry. Compare the virtual page number (VPN) and if
    there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

    But if you just tack the ASID onto the VA to get a larger VA and
    use that as an index then you don't get the same logic because
    the PTE.G flag that tells you whether ASID matters for that PTE,
    is on the PTE that you are looking up.

    One could achieve the same effect as the CAM approach by picking off the range of addresses assigned to the OS and assigning them to ASID = 0,
    and having the process range use the curent non-zero ASID register.

    If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
    and we added a 12-bit ASID then that gives a 76-bit virtual space
    with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
    1..4095 are for processes.

    SPARC V8 had 13-bit ASIDs and found that this was way to short for
    the market SUN was targeting. SPARC V9 went with 16-bit ASIDs and
    I think this is as SHORT as an ASID can be for "reasonable" future
    proofing.

    The MMU has a register MMU_OS_START that hold the OS unsigned lower
    start address and a comparator. If the VA is >= MMU_OS_START then
    it uses an ASID = 0, otherwise it comes from the MMU_ASID register.

    This 12-bit value is concatenated with the 52-bit VPN to give a
    64-bit combined ASID-VPN index value to lookup in the TLB.

    The OS software that assigned ASIDs also skips assigning 0 to any process.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 4 20:58:27 2026
    From Newsgroup: comp.arch


    kegs@provalid.com (Kent Dickey) posted:

    In article <10lji2u$2haq2$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    On 2026-01-30 4:44 p.m., BGB wrote:
    On 1/30/2026 3:03 PM, Robert Finch wrote:

    [ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
    be searched before moving to the next entry].

    It isn't obvious (at least to me) how what you are describing is much
    different from a SW managed TLB.

    It is not a lot different. On a translation miss the table is walked
    with HW though instead of SW walking main memory. There are enough
    entries in the table to cover the physical address range. Only page
    faults are handled in SW which should be rare. A page fault is generated >after 30 steps in the walk though. That is a miss on 240 possible >locations. 240-way associativity is not enough?
    The page fault handler will like just report 'out-of-memory' then quit
    the app.

    I'm going to guess your physical address space is just 32-bits, otherwise this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
    44-bit physical address space would mean 2^28 entries, with each entry
    being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
    entries of 64-bits each.

    A page table which doesn't scale is problematic. Especially if it needs
    OS support. Who wants to invest effort in a dead-end idea?

    Your hashed page table has issues which I think you should address:

    1) Supporting multiple page sizes is a pain. If you have 2MB pages and
    64KB pages, there are 32 64KB pages for each 2MB page. So do you
    lazily load the hashtable only on misses, or load all 32 aliases
    at once?

    I, for one, want to hear more about how he integrates multiple page
    sizes into an IPT or the associated TLB.

    2) It's common to just do a single hash lookup, then trap to software. Doing
    probing is a big pain. The problems come from removing entries.
    If you page out a page, it leaves gaps in the table. When you do
    a hardware lookup, and find no valid pages at the first entry, this
    doesn't tell you anything! The obvious solution is ALL hardware
    lookups must do the maximum searching before declaring a hash-table
    miss--which you say is 30 steps, checking 240 entries. I'm not sure
    there's a better solution than that, but this is a pretty big
    penalty for a TLB miss.

    3) It's very complex to efficiently update the hash table. On inserting
    a new entry, where do you put it? On one hand, a "move-to-front"
    strategy seems like a good idea--newly added pages seem like they
    should get lookup priority. But how do you do this? It looks like
    each software handled miss has a TON of work to do. It seems like
    you need to re-hash each entry you consider moving, to make sure it
    stays on its proper hash chain. And keeping the OS page structures
    in sync looks like even more work.

    4) The global nature of the hash table means multiple threads probably need
    a lock to change it. So misses to software have to get a lock before
    making any hashtable changes. This is very expensive. This is
    especially bad when a process ends--all of its translations need to
    be removed, and so all page table misses are blocked while that
    happens. This is bad.

    On the other hand, when an OS is provided with means to shrink the hier- archial page tables, it is rarely used.

    For example, cat can be compiled and linked on My 66000 into an object
    module of 10-12 pages--including the paging tables themselves (1 page).

    A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used.

    The other thing I want to hear is how HyperVisor or Host OS uses its IVPT
    to remap Guest OS IVP into the single physical address space. That is how
    are IVPTs virtualized ??

    It's indexed by a hash of ASID+VA, and does a single probe. It might be useful to do two hash tables: one for large pages, and one for small pages (with 2 lookups).
    On any miss, the hash entry to be updated is easy to find for software.
    But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
    hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you reverse bits and do other stuff that may be hard for software to calculate, you make it easy).

    This gives you the benefits of a hardware table walker with less trouble.

    An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
    is not much of an issue if the primary TLB is >= 256 entries, but it's
    really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.

    Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by
    a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary"
    TLB (or not with explanation) ?!?

    Kent
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 4 21:08:28 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-01 10:28 a.m., Kent Dickey wrote:
    In article <10lji2u$2haq2$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    On 2026-01-30 4:44 p.m., BGB wrote:
    On 1/30/2026 3:03 PM, Robert Finch wrote:

    [ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to be searched before moving to the next entry].

    It isn't obvious (at least to me) how what you are describing is much
    different from a SW managed TLB.

    It is not a lot different. On a translation miss the table is walked
    with HW though instead of SW walking main memory. There are enough
    entries in the table to cover the physical address range. Only page
    faults are handled in SW which should be rare. A page fault is generated >> after 30 steps in the walk though. That is a miss on 240 possible
    locations. 240-way associativity is not enough?
    The page fault handler will like just report 'out-of-memory' then quit
    the app.

    I'm going to guess your physical address space is just 32-bits, otherwise

    Yes, this is for a 29-bit DRAM address space. I was going to use 2^13 entries with 256kB (or possibly smaller) pages. It is only usable with a smaller address space that can fit into BRAM.

    this table is huge. I think you said you have 64KB pages (but then you said
    256KB pages--which is just too big, so I'll ignore that), which for a 44-bit physical address space would mean 2^28 entries, with each entry being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16 entries of 64-bits each.

    A page table which doesn't scale is problematic. Especially if it needs
    OS support. Who wants to invest effort in a dead-end idea?

    Your hashed page table has issues which I think you should address:

    1) Supporting multiple page sizes is a pain. If you have 2MB pages and
    64KB pages, there are 32 64KB pages for each 2MB page. So do you
    lazily load the hashtable only on misses, or load all 32 aliases
    at once?

    Supporting multiple page sized is a pain even for a hierarchical table. Multiple TLBs are used, so maybe multiple hash tables could be used. But
    I was not planning on supporting more than one page size.

    I support multiple page sizes, and at both the architectural level and
    in the TLB circuits I did not find it all that difficult. {Hint: I built
    one of the TLB for Ross HyperSparc chips.} There are several ways to do
    it in logic, none of it "hard". Most of it "a bit different".

    Defining Super pages is "just stopping" early in the tablewalk.
    Skipping intermediate levels is just like "stopping early" except
    for the stopping part.
    Skipping top layers is "child's play" at that point {Smaller VAS}.

    My 66000 has all of that; but in addition:
    My 66000 SuperPages use the unused PA bits as a limit function so a
    SuperPage of 1024 pages can be used to map 1-1024 actual page(s).


    2) It's common to just do a single hash lookup, then trap to software. Doing
    probing is a big pain. The problems come from removing entries.
    If you page out a page, it leaves gaps in the table. When you do
    a hardware lookup, and find no valid pages at the first entry, this
    doesn't tell you anything! The obvious solution is ALL hardware

    Thanks, I had forgotten the issue with removes. Inserts are easy without removes. I wonder if the removes could be done in bulk similar to a
    garbage collection. Just mark the entry as deleted rather than leaving a gap, then have a process that periodically rebuilds the table. For the
    table size I am planning on using garbage collecting it may not be too bad.

    The table probes until it finds the entry, or an empty entry in a group where the lookup should be. It stops after a certain number of probes if there are no empty entries in the group.

    The lookup returns the group and the group empty status at which there
    is an empty entry if the lookup is not found. So, adding a new entry
    should be straightforward.

    lookups must do the maximum searching before declaring a hash-table
    miss--which you say is 30 steps, checking 240 entries. I'm not sure
    there's a better solution than that, but this is a pretty big
    penalty for a TLB miss.

    30 steps (in HW) is about as fast as a divide operation. It only works
    for BRAM, it was DRAM it would be many clocks per lookup.

    Also: a far cry from a 6-cycle L2 TLB with 1024-E.


    3) It's very complex to efficiently update the hash table. On inserting
    a new entry, where do you put it? On one hand, a "move-to-front"
    strategy seems like a good idea--newly added pages seem like they
    should get lookup priority. But how do you do this? It looks like
    each software handled miss has a TON of work to do. It seems like

    I was not planning on anything that complex. The miss returns the group
    at which the entry should be located, so SW should know where to insert.
    For testing, the test bench fakes inserting pages. The issues rise when
    the table gets full.

    you need to re-hash each entry you consider moving, to make sure it
    stays on its proper hash chain. And keeping the OS page structures
    in sync looks like even more work.

    4) The global nature of the hash table means multiple threads probably need
    a lock to change it. So misses to software have to get a lock before
    making any hashtable changes. This is very expensive. This is
    especially bad when a process ends--all of its translations need to
    be removed, and so all page table misses are blocked while that
    happens. This is bad.

    Definitely an issue. May have only the system (one) thread able to
    update it.


    A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash
    of ASID+VA, and does a single probe. It might be useful to do two hash tables: one for large pages, and one for small pages (with 2 lookups).
    On any miss, the hash entry to be updated is easy to find for software. But: there's effectively no associativity--but just increase the hash table's
    size if software detects that you are getting conflicts. Pick a good
    hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you reverse bits and do other stuff that may be hard for software to calculate, you make it easy).

    This gives you the benefits of a hardware table walker with less trouble.

    An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
    is not much of an issue if the primary TLB is >= 256 entries, but it's really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
    page tables advantages.

    Kent

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Feb 5 15:07:55 2026
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    On 2026-02-01 12:49 p.m., EricP wrote:
    Simpler to have create process assign a new ASID by incrementing
    an unsigned up counter that starts at 1. When you hit its max then reset
    to 1 and clear the whole IVT for all non-zero ASID entries in one pass.


    A garbage collection approach may work for removed entries. It might be
    a complicated state machine though to do in HW. It could look for
    invalid or entries with defunct ASIDs.

    AFAIK the approach outlined by EricP has been used in commercial
    hardware. It's simple, and if there are enough ASIDs, cheap (because
    resets are rare). So better add more bits to the ASIDs than introduce complicated machinery for reclaiming ASIDs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Fri Feb 6 05:23:38 2026
    From Newsgroup: comp.arch

    In article <1770238707-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    kegs@provalid.com (Kent Dickey) posted:

    An advantage of hashed page tables is the TLB itself has no "intermediate" >> entries. This lets the primary TLB hold more useful translations. This
    is not much of an issue if the primary TLB is >= 256 entries, but it's
    really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
    page tables advantages.

    Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by
    a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary"
    TLB (or not with explanation) ?!?

    I think the 32-entry fully associative is the primary TLB. The 1024 entry
    is a secondary TLB. Hits in the primary TLB cause no stalls at all.
    Misses in the primary probably lead to stalls (but it is possible the
    6 cycle latency can be hidden sometimes), so it costs performance.

    If the primary TLB (the one with no stalls) is only 4 entries, then
    we can easily calculate a pretty substantial stall penalty for most
    programs since it will pay the 6 cycle stall a lot (and that's for a L2
    TLB hit). So we make the L1 TLB as big as possible, but it cannot be made
    too large or its too slow, so it ends up between 32- and 128-entries
    (256 tops).

    I'm just pointing out the intermediate hierarchical entries (the page
    table entries which are just page table pointers) need to cached in
    the TLB to make future lookups faster. And putting them there reduces
    the size of the TLB for "useful" entries. You can try to avoid this
    penalty by only putting page table pointer entries in the L2 TLB, but then you're just making primary TLB misses slower, especially since the L2 TLB lookup cannot scan all entries like the primary can--you have to probe
    each one separately.

    If you want to translate the address 0x1234_56789000 using 4KB pages, you
    can do a SINGLE primary associative TLB lookup, and look for:

    1) Page entry for 0x1234_56789000
    2) Page-table pointer for 0x1234_566xxxxx
    3) Page-table pointer for 0x1234_4xxxxxxx
    4) Page-table pointer for 0x120x_xxxxxxxx

    (you may need do a second lookup to get page-table entries, since it's expensive to deal with multiple hits in one clock).

    With the page-table-pointer for 0x1234_566xxxxx, you then do a single
    data cache lookup to get the page entry (say 3 more clocks). So putting
    these page-table pointers in the L1 TLB has a big benefit to the page
    walk times.

    But note that for the L2 TLB, looking each of those up may be a separate lookup. And you'd want to do a lookup in the reverse order from the hierarchical table walk--you want the page if it exists, then the closest page-table, etc. At 6-cycles latency, it seems like you would do the
    lookups back-to-back and not wait for the results, so you would look up
    all 4 in the L2 TLB in 9 clocks. And putting page-table-pointers in the
    L2 TLB is tricky, the indexing function needs thought (it's similar to the problem for handling variable page sizes with a hash table, but we KNOW
    which size we're looking for at each step at least).

    And I'm just pointing out putting these page table pointer entries in
    the L1 TLB makes the L1 TLB feel "smaller". For even the smallest
    process has three separate translation areas: code, data, and stack, and
    these are all on different pages, and they will all be touched
    regularly. So there's intermediate page table pointers crowding other
    useful translations out of the L1 TLB--this tiny process of 3 pages
    total could use 4+2+2 L1 TLB entries (I'm assuming the regions are at
    least 1GB apart, and using 4KB pages, which are reasonable assumptions).

    Accounting for the overhead, it seems like almost any process will waste
    at least 7 L1 TLB entries to these intermediate page table pointer
    entries, so your 32-entry L1 TLB will generally hold no more than 25
    useful pages.

    Kent
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Feb 6 09:29:08 2026
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    Robert Finch <robfi680@gmail.com> writes:
    On 2026-02-01 12:49 p.m., EricP wrote:
    Simpler to have create process assign a new ASID by incrementing
    an unsigned up counter that starts at 1. When you hit its max then reset >>> to 1 and clear the whole IVT for all non-zero ASID entries in one pass.


    A garbage collection approach may work for removed entries. It might be
    a complicated state machine though to do in HW. It could look for
    invalid or entries with defunct ASIDs.

    AFAIK the approach outlined by EricP has been used in commercial
    hardware. It's simple, and if there are enough ASIDs, cheap (because
    resets are rare). So better add more bits to the ASIDs than introduce complicated machinery for reclaiming ASIDs.

    - anton

    The method I described before was a bit brute force, as it looks like
    it has to stall the whole system while it erases the old ASIDs.
    It is inefficient as it erases all ASIDs including ones recently assigned.

    One issue with invalidating ASIDs is that the L1-TLB may be fully
    associative, the L2-TLB often are set associative.
    While a a full assoc TLB can invalidate multiple matching entries in
    1 clock, the set assoc TLB requires a scan doing a read-compare-write
    operation on all the ways in each set, taking 2 or more clocks per set.
    So an L2-TLB with 512 sets could take 1024 clocks or more to complete
    an ASID shoot down.

    This would also cause all running processes to have a flurry of TLB misses
    as they reload the entries just purged.

    I would want something soother, continuous, that usually didn't need to
    stall the whole system, and only recycles the oldest ASIDs.

    Something like, split the ASIDs into 4 groups using the high 2 msb.
    As the ASID counter increments from group N to group N+1, it initiates an invalidate of group N+2, such that it retains the recent ASIDs in group N
    and N-1. And this is circular so when it starts assigning in G3 it
    initiates the invalidate of G0. The invalidate is also asynchronous,
    so the system can continue to run and assign ASIDs while it invalidates
    the next group ahead (there are some interlocks waiting for ACK's).

    e.g. Say the ASID counter is 7 bits so 4 groups of 32 IDs, called G0..G3.
    The current assignment ASID counter = 31 in G1.
    When its increments to 32 in G2, it begins invalidating all group G3 ids
    in the range 96..128. Any processes currently using ASIDs in the G3 range
    have their ASID invalidated, and when they next run will assign a new id
    in the current assignment group. All processes in groups G0 and G1 are unaffected by this ASID recycling. Core broadcasts an IPI Interprocessor Interrupt to all other cores to shoot down all group G3 ASIDs.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Feb 6 11:59:27 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    EricP <ThatWouldBeTelling@thevillage.com> posted:

    EricP wrote:
    Robert Finch wrote:
    ??? I think a hash table has this characteristic. Multiple VA pages
    can point to a single physical page using a a hash table. Is it just a >>>> performance issue? ASID is part of the hash/compare.
    Be careful about ASID's - they aren't just part of the virtual address
    in MMU's like x86. There is a logic aspect in the address compare.
    If the PTE Global flag is set then ASID is NOT included in the VA compare, >>> if the PTE Global flag is clear then ASID is included in the VA compare. >>>
    If the G flag is set on OS kernel pages then the effect is to make the
    kernel virtual pages one single address space on all cores with no
    associated ASID, while process virtual spaces reside in their own ASID's. >>>
    The above is easy to implement in a CAM because the Global flag is
    part of the PTE entry. Compare the virtual page number (VPN) and if
    there is a PTE match, then if the PTE.G flag is set take ASID as matched, >>> otherwise check if ASID is matched.

    But if you just tack the ASID onto the VA to get a larger VA and
    use that as an index then you don't get the same logic because
    the PTE.G flag that tells you whether ASID matters for that PTE,
    is on the PTE that you are looking up.
    One could achieve the same effect as the CAM approach by picking off the
    range of addresses assigned to the OS and assigning them to ASID = 0,
    and having the process range use the curent non-zero ASID register.

    If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
    and we added a 12-bit ASID then that gives a 76-bit virtual space
    with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
    1..4095 are for processes.

    SPARC V8 had 13-bit ASIDs and found that this was way to short for
    the market SUN was targeting. SPARC V9 went with 16-bit ASIDs and
    I think this is as SHORT as an ASID can be for "reasonable" future
    proofing.

    It also depends on whether ASIDs are assigned globally for a whole system
    or locally for each core. Local assignment for each core needs less bits.
    For globally assigned ASID 16 bits sounds too small for today -
    they recycle too often.

    When ASIDs were first introduced on x86 I remember a certain amount
    of concern for the effects they would have on the fully assoc TLB.
    The 32-bit x86 has a 20-bit Virtual Page Number so adding a 12-bit ASID
    would increase the TLB size and power usage by 60%.

    Back then I spent some time looking at how to do local 4-bit ASIDs per-core. That is, each core has its own 4-bit ASID counter and each process
    remembers which ASID its used last on each core.
    Each core's TLB retains entries for up to the last 16 processes to run
    threads on that core, and each core recycles its ASID independently.
    The 4-bit size was chosen based on the assumption that the odds are small
    in a TLB containing useful entries beyond the last 16 processes to run.

    This eliminates many the global ASID shoot downs but 4 bits is too small as
    it would recycle too quickly causing too frequent local ASID invalidates.

    Combining all the ideas, 16+ bit ASID, locally assigned per core,
    with a batch TLB ASID invalidate mechanism using a bit mask for matching
    ASIDs so I can invalidate all ids in just one of the groups I mentioned elsewhere; GGxx_xxxx_xxxx_xxxx where GG is the 2-bit group number.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Feb 6 18:01:15 2026
    From Newsgroup: comp.arch


    kegs@provalid.com (Kent Dickey) posted:

    In article <1770238707-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    kegs@provalid.com (Kent Dickey) posted:

    An advantage of hashed page tables is the TLB itself has no "intermediate" >> entries. This lets the primary TLB hold more useful translations. This >> is not much of an issue if the primary TLB is >= 256 entries, but it's
    really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
    page tables advantages.

    Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by
    a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary"
    TLB (or not with explanation) ?!?

    I think the 32-entry fully associative is the primary TLB. The 1024 entry
    is a secondary TLB. Hits in the primary TLB cause no stalls at all.
    Misses in the primary probably lead to stalls (but it is possible the
    6 cycle latency can be hidden sometimes), so it costs performance.

    If the primary TLB (the one with no stalls) is only 4 entries, then
    we can easily calculate a pretty substantial stall penalty for most
    programs since it will pay the 6 cycle stall a lot (and that's for a L2
    TLB hit). So we make the L1 TLB as big as possible, but it cannot be made too large or its too slow, so it ends up between 32- and 128-entries
    (256 tops).

    I'm just pointing out the intermediate hierarchical entries (the page
    table entries which are just page table pointers) need to cached in
    the TLB to make future lookups faster.

    Cached yes, in TLB no--generally cached in TableWalk Accelerator store.

    And putting them there reduces
    the size of the TLB for "useful" entries.

    Which is why you don't put 'em there.

    You can try to avoid this
    penalty by only putting page table pointer entries in the L2 TLB, but then you're just making primary TLB misses slower, especially since the L2 TLB lookup cannot scan all entries like the primary can--you have to probe
    each one separately.

    If you want to translate the address 0x1234_56789000 using 4KB pages, you
    can do a SINGLE primary associative TLB lookup, and look for:

    1) Page entry for 0x1234_56789000
    2) Page-table pointer for 0x1234_566xxxxx
    3) Page-table pointer for 0x1234_4xxxxxxx
    4) Page-table pointer for 0x120x_xxxxxxxx

    (you may need do a second lookup to get page-table entries, since it's expensive to deal with multiple hits in one clock).

    Another reason to put them other than in TLB. One can make all 4 accesses simultaneously, and start the tablewalk as far down the tree as possible.

    With the page-table-pointer for 0x1234_566xxxxx, you then do a single
    data cache lookup to get the page entry (say 3 more clocks). So putting these page-table pointers in the L1 TLB has a big benefit to the page
    walk times.

    A delicate balance, especially if L2 is less than 10 cycles. ------------------------
    Accounting for the overhead, it seems like almost any process will waste
    at least 7 L1 TLB entries to these intermediate page table pointer
    entries, so your 32-entry L1 TLB will generally hold no more than 25
    useful pages.

    So, you don't put 'em in TLB you put 'em in TW-Accelerator Store.

    Kent
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Jan 30 11:51:15 2026
    From Newsgroup: comp.arch

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS support
    is there for them?

    IVTs are a terrible idea. Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked
    in hardware very quickly, one cycle per step as opposed to walking the
    page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for a TLB.

    For the IVT implementation the table is twice the size needed to cover
    the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW that
    cause an issue?

    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for instance.

    A hierarchical page table is a better solution, but I was looking for something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 30 20:35:14 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS support
    is there for them?

    IVTs are a terrible idea. Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked
    in hardware very quickly, one cycle per step as opposed to walking the
    page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for a TLB.

    For the IVT implementation the table is twice the size needed to cover
    the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW that cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for instance.

    A hierarchical page table is a better solution, but I was looking for something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Feb 1 16:41:34 2026
    From Newsgroup: comp.arch

    On 2026-02-01 10:28 a.m., Kent Dickey wrote:
    In article <10lji2u$2haq2$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    On 2026-01-30 4:44 p.m., BGB wrote:
    On 1/30/2026 3:03 PM, Robert Finch wrote:

    [ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
    be searched before moving to the next entry].

    It isn't obvious (at least to me) how what you are describing is much
    different from a SW managed TLB.

    It is not a lot different. On a translation miss the table is walked
    with HW though instead of SW walking main memory. There are enough
    entries in the table to cover the physical address range. Only page
    faults are handled in SW which should be rare. A page fault is generated
    after 30 steps in the walk though. That is a miss on 240 possible
    locations. 240-way associativity is not enough?
    The page fault handler will like just report 'out-of-memory' then quit
    the app.

    I'm going to guess your physical address space is just 32-bits, otherwise

    Yes, this is for a 29-bit DRAM address space. I was going to use 2^13
    entries with 256kB (or possibly smaller) pages. It is only usable with a smaller address space that can fit into BRAM.

    this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
    44-bit physical address space would mean 2^28 entries, with each entry
    being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
    entries of 64-bits each.

    A page table which doesn't scale is problematic. Especially if it needs
    OS support. Who wants to invest effort in a dead-end idea?

    Your hashed page table has issues which I think you should address:

    1) Supporting multiple page sizes is a pain. If you have 2MB pages and
    64KB pages, there are 32 64KB pages for each 2MB page. So do you
    lazily load the hashtable only on misses, or load all 32 aliases
    at once?

    Supporting multiple page sized is a pain even for a hierarchical table. Multiple TLBs are used, so maybe multiple hash tables could be used. But
    I was not planning on supporting more than one page size.


    2) It's common to just do a single hash lookup, then trap to software. Doing
    probing is a big pain. The problems come from removing entries.
    If you page out a page, it leaves gaps in the table. When you do
    a hardware lookup, and find no valid pages at the first entry, this
    doesn't tell you anything! The obvious solution is ALL hardware

    Thanks, I had forgotten the issue with removes. Inserts are easy without removes. I wonder if the removes could be done in bulk similar to a
    garbage collection. Just mark the entry as deleted rather than leaving a
    gap, then have a process that periodically rebuilds the table. For the
    table size I am planning on using garbage collecting it may not be too bad.

    The table probes until it finds the entry, or an empty entry in a group
    where the lookup should be. It stops after a certain number of probes if
    there are no empty entries in the group.

    The lookup returns the group and the group empty status at which there
    is an empty entry if the lookup is not found. So, adding a new entry
    should be straightforward.

    lookups must do the maximum searching before declaring a hash-table
    miss--which you say is 30 steps, checking 240 entries. I'm not sure
    there's a better solution than that, but this is a pretty big
    penalty for a TLB miss.

    30 steps (in HW) is about as fast as a divide operation. It only works
    for BRAM, it was DRAM it would be many clocks per lookup.


    3) It's very complex to efficiently update the hash table. On inserting
    a new entry, where do you put it? On one hand, a "move-to-front"
    strategy seems like a good idea--newly added pages seem like they
    should get lookup priority. But how do you do this? It looks like
    each software handled miss has a TON of work to do. It seems like

    I was not planning on anything that complex. The miss returns the group
    at which the entry should be located, so SW should know where to insert.
    For testing, the test bench fakes inserting pages. The issues rise when
    the table gets full.

    you need to re-hash each entry you consider moving, to make sure it
    stays on its proper hash chain. And keeping the OS page structures
    in sync looks like even more work.

    4) The global nature of the hash table means multiple threads probably need
    a lock to change it. So misses to software have to get a lock before
    making any hashtable changes. This is very expensive. This is
    especially bad when a process ends--all of its translations need to
    be removed, and so all page table misses are blocked while that
    happens. This is bad.

    Definitely an issue. May have only the system (one) thread able to
    update it.


    A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash of ASID+VA, and does a single probe. It might be useful to do two hash tables: one for large pages, and one for small pages (with 2 lookups).
    On any miss, the hash entry to be updated is easy to find for software.
    But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
    hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you reverse bits and do other stuff that may be hard for software to calculate, you make it easy).

    This gives you the benefits of a hardware table walker with less trouble.

    An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
    is not much of an issue if the primary TLB is >= 256 entries, but it's
    really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.

    Kent

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Feb 8 19:28:51 2026
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    MitchAlsup wrote:
    EricP <ThatWouldBeTelling@thevillage.com> posted:

    EricP wrote:
    Robert Finch wrote:
    ??? I think a hash table has this characteristic. Multiple VA pages >>>> can point to a single physical page using a a hash table. Is it just a >>>> performance issue? ASID is part of the hash/compare.
    Be careful about ASID's - they aren't just part of the virtual address >>> in MMU's like x86. There is a logic aspect in the address compare.
    If the PTE Global flag is set then ASID is NOT included in the VA compare,
    if the PTE Global flag is clear then ASID is included in the VA compare. >>>
    If the G flag is set on OS kernel pages then the effect is to make the >>> kernel virtual pages one single address space on all cores with no
    associated ASID, while process virtual spaces reside in their own ASID's. >>>
    The above is easy to implement in a CAM because the Global flag is
    part of the PTE entry. Compare the virtual page number (VPN) and if
    there is a PTE match, then if the PTE.G flag is set take ASID as matched, >>> otherwise check if ASID is matched.

    But if you just tack the ASID onto the VA to get a larger VA and
    use that as an index then you don't get the same logic because
    the PTE.G flag that tells you whether ASID matters for that PTE,
    is on the PTE that you are looking up.
    One could achieve the same effect as the CAM approach by picking off the >> range of addresses assigned to the OS and assigning them to ASID = 0,
    and having the process range use the curent non-zero ASID register.

    If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
    and we added a 12-bit ASID then that gives a 76-bit virtual space
    with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
    1..4095 are for processes.

    SPARC V8 had 13-bit ASIDs and found that this was way to short for
    the market SUN was targeting. SPARC V9 went with 16-bit ASIDs and
    I think this is as SHORT as an ASID can be for "reasonable" future proofing.

    It also depends on whether ASIDs are assigned globally for a whole system
    or locally for each core. Local assignment for each core needs less bits.
    For globally assigned ASID 16 bits sounds too small for today -
    they recycle too often.

    When ASIDs were first introduced on x86 I remember a certain amount
    of concern for the effects they would have on the fully assoc TLB.
    The 32-bit x86 has a 20-bit Virtual Page Number so adding a 12-bit ASID
    would increase the TLB size and power usage by 60%.

    TLB CAM size did grow by 60%. TLB data store by next to nothing.

    Power grew by trifling, because ASID is constant between context switch.
    That is: the wires into TLB corresponding to ASID change once and remain constant until changed again, so no dynamic power--and the CAM cells can
    remain matched/unMatched for the duration. So ASID only consumes TLB power
    on context switch and TLB PTE install. Under normal conditions this is not
    that much power.

    Back then I spent some time looking at how to do local 4-bit ASIDs per-core. That is, each core has its own 4-bit ASID counter and each process
    remembers which ASID its used last on each core.
    Each core's TLB retains entries for up to the last 16 processes to run threads on that core, and each core recycles its ASID independently.
    The 4-bit size was chosen based on the assumption that the odds are small
    in a TLB containing useful entries beyond the last 16 processes to run.

    My guess is that all OS threads {interrupts, exceptions, OS threads} used
    the same ASID (likely 0000) so, you only really have 15 application threads.

    This eliminates many the global ASID shoot downs but 4 bits is too small as it would recycle too quickly causing too frequent local ASID invalidates.

    Especially when one has HyperVisor, Host OS, Guest OS ... with each Guest OS thinking that it has its own set of ASIDs.

    In my system, there are 4-sets of 16-bit ASIDs active at all times, an ASID
    for {HV, HOS, GOS, App}

    Combining all the ideas, 16+ bit ASID, locally assigned per core,
    with a batch TLB ASID invalidate mechanism using a bit mask for matching ASIDs so I can invalidate all ids in just one of the groups I mentioned elsewhere; GGxx_xxxx_xxxx_xxxx where GG is the 2-bit group number.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 15:26:09 2026
    From Newsgroup: comp.arch

    On 2/3/26 1:03 AM, Robert Finch wrote:
    After getting bogged down testing the hash table, I decided to
    shelf it for now. And...

    I re-wrote Qupls4 TLB for hierarchical page tables. It is much
    cleaner code now. The TLB is taking up only about 600 LUTs and 8
    BRAMs now. It is also simpler being only a single level TLB.
    Three copies of the TLB would be needed.

    Thinking about adding caching of translation steps to avoid main
    memory access.

    If you use huge pages where page table nodes can be larger
    pages, then you might consider merging the page directory entry
    and huge PTE cache. This wastes N bits (10 bits for 32-bit PTEs
    and 4 KiB pages) for the PDE unless huge pages use these bits
    for additional TLB-managed features (or PDEs are constrained to
    an N-bit smaller address space, which is awkward). However, such
    allows entries to be used for both kinds of data (since both
    have the same index:tag bits). (if I recall correctly x86
    allowed 4 MiB pages to use a 42-bit address space by using the
    bits that would otherwise be zero due to 4 MiB huge page
    alignment. (I think this sharing works better for a modest-sized
    L2 huge page TLB/PDE cache.)

    This sharing saves a few tag bits compared with hash/rehash TLB
    for multiple page sizes at the cost of fixed large region
    caching size (though a hash/rehash TLB could repurpose the 2N
    bits rCo N from the tag and N from the translation for huge
    pages).

    In theory, with rehashing the reduced tag entries could be used
    for base-sized pages that can be sign extended for the missing
    bits. Such might be useful as a victim cache, but I suspect such
    a feature would not be worthwhile.

    With PDE caching, it also becomes possible to use Andr|- Seznec's
    "Don't use the page number but a pointer to it" for TLB tags to
    compress the tags. (The PDE cache already holds the top virtual
    address bits [and any ASID], so the small page TLB entries can
    use a pointer to the PDE cache entry to compress the TLB entry.
    Unlike Seznec's proposal, this requires reading an extra table,
    though if the PDE cache used for this compression is also a
    large page TLB then parallel access might be desired anyway.)

    Replacement decisions might be more complicated with support for
    different entry types.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 21:28:14 2026
    From Newsgroup: comp.arch

    On 2/8/26 5:57 PM, BGB wrote:
    On 2/5/2026 11:23 PM, Kent Dickey wrote:
    In article <1770238707-5857@newsgrouper.org>,
    [snip]
    I think the 32-entry fully associative is the primary TLB.
    The 1024 entry
    is a secondary TLB.-a Hits in the primary TLB cause no stalls
    at all.
    Misses in the primary probably lead to stalls (but it is
    possible the
    6 cycle latency can be hidden sometimes), so it costs
    performance.

    If the primary TLB (the one with no stalls) is only 4 entries,
    then
    we can easily calculate a pretty substantial stall penalty for
    most
    programs since it will pay the 6 cycle stall a lot (and that's
    for a L2
    TLB hit).-a So we make the L1 TLB as big as possible, but it
    cannot be made
    too large or its too slow, so it ends up between 32- and
    128-entries
    (256 tops).


    Full associativity isn't worth it.

    The gains from more associativity seem to max out at around
    8-way, and 32 pages isn't nearly enough coverage (very high TLB
    miss rates would be unavoidable).

    Full associativity is not just about conflict misses. Such also
    facilitates supporting multiple page sizes in the same TLB.
    Different page sizes naturally use different indexing bits, so
    having zero indexing bits means that any page can occupy any entry.

    Andr|- Seznec proposed a way to use (overlaid) skewed
    associativity to support multiple page sizes ("Concurrent
    Support of Multiple Page Sizes On a Skewed Associative TLB",
    2003). This has the limitation of requiring shared indexing bits
    so that the different page size probes (done in parallel) can be
    mapped to different banks. This also requires accessing more
    entries with limited associativity (the skewed associative
    conflict avoids reduces the miss rate that this would otherwise
    cause as well as balances utilization among page sizes).

    AMD used full associativity to support coalescing of 4KiB pages
    into 16KiB (32KiB in Zen+ and earlier) TLB entries. (An ordinary
    8-way associative TLB indexed for 16KiB "pages" would map all
    uncoalescable 4KiB pages in a 16KiB chunk to the same set,
    increasing conflict misses.) It might have been possible to
    index by the shared bits of coalesced and base page sizes and
    use a CAM-based predictor to choose whether to index the rest by
    base page or by coalesced page, but I suspect such would not be
    that much smaller than using full associativity and likely
    slower more power consuming as well as vulnerable to false
    conflicts from the predictor limiting which PTEs can share a set.

    Full associativity may also have been important for Itanium 2's
    prevalidated cache, where each cache tag had one hot bit
    indicating which TLB entry applied to it. This allowed faster
    tag check and flash invalidation of (write-through) cache
    entries when a TLB entry was evicted. The number of bits in the
    cache tag limits the TLB capacity, which makes associativity
    more important, and TLB misses are more expensive since they
    might invalidate useful cache blocks. (Write through may have
    been desirable for reliability or been a consequence of seeking
    a low latency mechanism for L1 accesses.)

    Of course, the tradeoffs are different for an FPGA implementation.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 21:55:29 2026
    From Newsgroup: comp.arch

    On 2/6/26 9:29 AM, EricP wrote:
    Anton Ertl wrote:
    Robert Finch <robfi680@gmail.com> writes:
    On 2026-02-01 12:49 p.m., EricP wrote:
    Simpler to have create process assign a new ASID by
    incrementing
    an unsigned up counter that starts at 1. When you hit its
    max then reset
    to 1 and clear the whole IVT for all non-zero ASID entries
    in one pass.


    A garbage collection approach may work for removed entries.
    It might be a complicated state machine though to do in HW.
    It could look for invalid or entries with defunct ASIDs.

    AFAIK the approach outlined by EricP has been used in commercial
    hardware.-a It's simple, and if there are enough ASIDs, cheap
    (because
    resets are rare).-a So better add more bits to the ASIDs than
    introduce complicated machinery for reclaiming ASIDs.

    - anton

    The method I described before was a bit brute force, as it looks
    like
    it has to stall the whole system while it erases the old ASIDs.
    It is inefficient as it erases all ASIDs including ones recently
    assigned.

    One issue with invalidating ASIDs is that the L1-TLB may be fully associative, the L2-TLB often are set associative.
    While a a full assoc TLB can invalidate multiple matching
    entries in
    1 clock, the set assoc TLB requires a scan doing a
    read-compare-write
    operation on all the ways in each set, taking 2 or more clocks
    per set.
    So an L2-TLB with 512 sets could take 1024 clocks or more to
    complete
    an ASID shoot down.

    This would also cause all running processes to have a flurry of
    TLB misses
    as they reload the entries just purged.

    I would want something soother, continuous, that usually didn't
    need to
    stall the whole system, and only recycles the oldest ASIDs.

    Something like, split the ASIDs into 4 groups using the high 2 msb.
    As the ASID counter increments from group N to group N+1, it
    initiates an
    invalidate of group N+2, such that it retains the recent ASIDs
    in group N
    and N-1. And this is circular so when it starts assigning in G3 it
    initiates the invalidate of G0. The invalidate is also
    asynchronous,
    so the system can continue to run and assign ASIDs while it
    invalidates
    the next group ahead (there are some interlocks waiting for ACK's).

    e.g. Say the ASID counter is 7 bits so 4 groups of 32 IDs,
    called G0..G3.
    The current assignment ASID counter = 31 in G1.
    When its increments to 32 in G2, it begins invalidating all
    group G3 ids
    in the range 96..128. Any processes currently using ASIDs in the
    G3 range
    have their ASID invalidated, and when they next run will assign
    a new id
    in the current assignment group. All processes in groups G0 and
    G1 are
    unaffected by this ASID recycling. Core broadcasts an IPI
    Interprocessor
    Interrupt to all other cores to shoot down all group G3 ASIDs.

    One could also use a one hot bit encoding with shared write
    ports (flash invalidation). Four groups would add two bits of
    overhead (plus whatever overhead is associated with the shared
    write port).

    I suspect pre-scrubbing via probing would be more preferred.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Mon Feb 9 00:11:45 2026
    From Newsgroup: comp.arch

    On 2026-02-08 9:28 p.m., Paul Clayton wrote:
    On 2/8/26 5:57 PM, BGB wrote:
    On 2/5/2026 11:23 PM, Kent Dickey wrote:
    In article <1770238707-5857@newsgrouper.org>,
    [snip]
    I think the 32-entry fully associative is the primary TLB. The 1024
    entry
    is a secondary TLB.-a Hits in the primary TLB cause no stalls at all.
    Misses in the primary probably lead to stalls (but it is possible the
    6 cycle latency can be hidden sometimes), so it costs performance.

    If the primary TLB (the one with no stalls) is only 4 entries, then
    we can easily calculate a pretty substantial stall penalty for most
    programs since it will pay the 6 cycle stall a lot (and that's for a L2
    TLB hit).-a So we make the L1 TLB as big as possible, but it cannot be
    made
    too large or its too slow, so it ends up between 32- and 128-entries
    (256 tops).


    Full associativity isn't worth it.

    The gains from more associativity seem to max out at around 8-way, and
    32 pages isn't nearly enough coverage (very high TLB miss rates would
    be unavoidable).

    Full associativity is not just about conflict misses. Such also
    facilitates supporting multiple page sizes in the same TLB. Different
    page sizes naturally use different indexing bits, so having zero
    indexing bits means that any page can occupy any entry.

    Andr|- Seznec proposed a way to use (overlaid) skewed associativity to support multiple page sizes ("Concurrent Support of Multiple Page Sizes
    On a Skewed Associative TLB", 2003). This has the limitation of
    requiring shared indexing bits so that the different page size probes
    (done in parallel) can be mapped to different banks. This also requires accessing more entries with limited associativity (the skewed
    associative conflict avoids reduces the miss rate that this would
    otherwise cause as well as balances utilization among page sizes).

    The article requires signing in with an IEEE membership. I used to have
    a student membership many years ago. I let it expire as I was not using
    it, this was before the internet was big. Every so often I contemplate
    getting one, but I am trying to keep costs down.

    I have gotten a pretty good imagination for how things work just by
    reading headlines or summaries.

    I have gone ahead and made a TLB that is only one level though to
    conserve resources. The TLB is four-way associative with 512 entries
    (making it smaller does not reduce resource usage). I found the BRAM TLB
    could be made fast enough to support the CPU with single cycle
    translations which means having a faster L1 was not a benefit.

    AMD used full associativity to support coalescing of 4KiB pages into
    16KiB (32KiB in Zen+ and earlier) TLB entries. (An ordinary 8-way associative TLB indexed for 16KiB "pages" would map all uncoalescable
    4KiB pages in a 16KiB chunk to the same set, increasing conflict
    misses.) It might have been possible to index by the shared bits of coalesced and base page sizes and use a CAM-based predictor to choose whether to index the rest by base page or by coalesced page, but I
    suspect such would not be that much smaller than using full
    associativity and likely slower more power consuming as well as
    vulnerable to false conflicts from the predictor limiting which PTEs can share a set.

    Full associativity may also have been important for Itanium 2's
    prevalidated cache, where each cache tag had one hot bit indicating
    which TLB entry applied to it. This allowed faster tag check and flash invalidation of (write-through) cache entries when a TLB entry was
    evicted. The number of bits in the cache tag limits the TLB capacity,
    which makes associativity more important, and TLB misses are more
    expensive since they might invalidate useful cache blocks. (Write
    through may have been desirable for reliability or been a consequence of seeking a low latency mechanism for L1 accesses.)

    Of course, the tradeoffs are different for an FPGA implementation.

    Yeah, CAMs are expensive in the FPGAs. They are often two cycle too.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Feb 9 15:44:26 2026
    From Newsgroup: comp.arch

    Robert Finch wrote:
    On 2026-02-08 9:28 p.m., Paul Clayton wrote:

    Andr|- Seznec proposed a way to use (overlaid) skewed associativity to
    support multiple page sizes ("Concurrent Support of Multiple Page
    Sizes On a Skewed Associative TLB", 2003). This has the limitation of
    requiring shared indexing bits so that the different page size probes
    (done in parallel) can be mapped to different banks. This also
    requires accessing more entries with limited associativity (the skewed
    associative conflict avoids reduces the miss rate that this would
    otherwise cause as well as balances utilization among page sizes).

    The article requires signing in with an IEEE membership. I used to have
    a student membership many years ago. I let it expire as I was not using
    it, this was before the internet was big. Every so often I contemplate getting one, but I am trying to keep costs down.

    A bit of poking about finds her home page

    https://team.inria.fr/pacap/members/andre-seznec/

    which links to

    https://team.inria.fr/pacap/members/andre-seznec/cache-architecture-research-2/

    and to her paper

    Concurrent Support of Multiple Page Sizes on a Skewed Associative TLB, 2003 http://www.irisa.fr/caps/people/seznec/SKEWEDTLBperso.pdf


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Feb 9 16:27:15 2026
    From Newsgroup: comp.arch

    On 2/9/2026 2:44 PM, EricP wrote:
    Robert Finch wrote:
    On 2026-02-08 9:28 p.m., Paul Clayton wrote:

    Andr|- Seznec proposed a way to use (overlaid) skewed associativity to
    support multiple page sizes ("Concurrent Support of Multiple Page
    Sizes On a Skewed Associative TLB", 2003). This has the limitation of
    requiring shared indexing bits so that the different page size probes
    (done in parallel) can be mapped to different banks. This also
    requires accessing more entries with limited associativity (the
    skewed associative conflict avoids reduces the miss rate that this
    would otherwise cause as well as balances utilization among page sizes). >>>
    The article requires signing in with an IEEE membership. I used to
    have a student membership many years ago. I let it expire as I was not
    using it, this was before the internet was big. Every so often I
    contemplate getting one, but I am trying to keep costs down.

    A bit of poking about finds her home page

    https://team.inria.fr/pacap/members/andre-seznec/

    which links to

    https://team.inria.fr/pacap/members/andre-seznec/cache-architecture- research-2/

    and to her paper

    Concurrent Support of Multiple Page Sizes on a Skewed Associative TLB, 2003 http://www.irisa.fr/caps/people/seznec/SKEWEDTLBperso.pdf


    My usual thing here was:
    In mixed page-size configurations, the MMU is set to the smallest in-use
    page size;
    When set to larger page sizes, a few of the bits that were low-order
    bits with a smaller page size change to being high-order bits with a
    larger page size (all other bits remaining in the same place).


    So, say:
    If using both 4K and 16K pages, MMU is set for 4K;
    But, can set it for 16K if no 4K pages are being used;
    Or, 64K if not 4K or 16K pages are being used.

    With normal page tables, there is a defined "basic" page size and any
    other page sizes would exist as some multiple of this size. There is a
    slight inefficiency when dealing with larger pages with a smaller basic
    page size, but this was a reasonable tradeoff.

    If the page size changes between ASIDs, this isn't a huge issue as
    normally ASIDs can't see each other's pages anyways, except for global
    pages, but if using page-tables then any global pages shared between
    page tables would implicitly need the same page size.

    Seemed "obvious enough".



    Paper seems to have the idea of subdividing associativity if multiple
    page sizes are in use at the same time. I guess this works, if one has
    enough associativity.

    In this case 4 active page sizes would require at least an 8-way TLB to
    avoid "shooting oneself in the foot" though. Here, 2 active page sizes
    could be used with a 4-way TLB, where at least 2-way is needed (assuming modulo addressing) to maintain stability.

    This also works I guess.


    Ironically, a similar downside existed in my case when using Addr96
    modes, where this also came at the cost of reducing TLB associativity
    (though, this cost can be reduced when using primarily 48-bit
    addressing, *).

    *: Say, for example, if pages are either part of:
    The "null quadrant" (low 48 bit range of the 96-bit VAS);
    Part of the "current quadrant" (as defined by GBH);
    Normal 48-bit VAS memory ops are relative to GBH;
    Or, PCH for I$, but nominally PCH==GBH.
    Except maybe if you want to do a Harvard Addressing thing.
    In this case, can potentially key the high 48 bits to the ASID, so
    things still work so long as GBH is assumed to not change within a given
    ASID, and the page-miss handler can upload pages in the cheaper 48-bit
    VAS format rather than 96-bit VAS format (and sidestep the associativity cost).

    But, have mostly ended up not using the 96-bit VAS stuff, as it is kinda overkill.

    Even if, yeah, do have a chunk of encoding space being eaten by
    Load/Store ops that are mostly being unused and exist primarily to deal
    with 128-bit pointers. In this case would still need a 96-bit TLBE
    whenever accessing a page in some other part of the VAS.

    ...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Mon Feb 9 20:56:34 2026
    From Newsgroup: comp.arch

    On 2026-02-09 5:27 p.m., BGB wrote:
    On 2/9/2026 2:44 PM, EricP wrote:
    Robert Finch wrote:
    On 2026-02-08 9:28 p.m., Paul Clayton wrote:

    Andr|- Seznec proposed a way to use (overlaid) skewed associativity
    to support multiple page sizes ("Concurrent Support of Multiple Page
    Sizes On a Skewed Associative TLB", 2003). This has the limitation
    of requiring shared indexing bits so that the different page size
    probes (done in parallel) can be mapped to different banks. This
    also requires accessing more entries with limited associativity (the
    skewed associative conflict avoids reduces the miss rate that this
    would otherwise cause as well as balances utilization among page
    sizes).

    The article requires signing in with an IEEE membership. I used to
    have a student membership many years ago. I let it expire as I was
    not using it, this was before the internet was big. Every so often I
    contemplate getting one, but I am trying to keep costs down.

    A bit of poking about finds her home page

    https://team.inria.fr/pacap/members/andre-seznec/

    which links to

    https://team.inria.fr/pacap/members/andre-seznec/cache-architecture-
    research-2/

    and to her paper

    Concurrent Support of Multiple Page Sizes on a Skewed Associative TLB,
    2003
    http://www.irisa.fr/caps/people/seznec/SKEWEDTLBperso.pdf


    My usual thing here was:
    In mixed page-size configurations, the MMU is set to the smallest in-use page size;
    When set to larger page sizes, a few of the bits that were low-order
    bits with a smaller page size change to being high-order bits with a
    larger page size (all other bits remaining in the same place).


    So, say:
    If using both 4K and 16K pages, MMU is set for 4K;
    But, can set it for 16K if no 4K pages are being used;
    Or, 64K if not 4K or 16K pages are being used.

    With normal page tables, there is a defined "basic" page size and any
    other page sizes would exist as some multiple of this size. There is a slight inefficiency when dealing with larger pages with a smaller basic
    page size, but this was a reasonable tradeoff.

    If the page size changes between ASIDs, this isn't a huge issue as
    normally ASIDs can't see each other's pages anyways, except for global pages, but if using page-tables then any global pages shared between
    page tables would implicitly need the same page size.

    Seemed "obvious enough".



    Paper seems to have the idea of subdividing associativity if multiple
    page sizes are in use at the same time. I guess this works, if one has enough associativity.

    In this case 4 active page sizes would require at least an 8-way TLB to avoid "shooting oneself in the foot" though. Here, 2 active page sizes
    could be used with a 4-way TLB, where at least 2-way is needed (assuming modulo addressing) to maintain stability.

    This also works I guess.


    Ironically, a similar downside existed in my case when using Addr96
    modes, where this also came at the cost of reducing TLB associativity (though, this cost can be reduced when using primarily 48-bit
    addressing, *).

    *: Say, for example, if pages are either part of:
    -a The "null quadrant" (low 48 bit range of the 96-bit VAS);
    -a Part of the "current quadrant" (as defined by GBH);
    -a-a-a Normal 48-bit VAS memory ops are relative to GBH;
    -a-a-a Or, PCH for I$, but nominally PCH==GBH.
    -a-a-a-a-a Except maybe if you want to do a Harvard Addressing thing.
    In this case, can potentially key the high 48 bits to the ASID, so
    things still work so long as GBH is assumed to not change within a given ASID, and the page-miss handler can upload pages in the cheaper 48-bit
    VAS format rather than 96-bit VAS format (and sidestep the associativity cost).

    But, have mostly ended up not using the 96-bit VAS stuff, as it is kinda overkill.

    Even if, yeah, do have a chunk of encoding space being eaten by Load/
    Store ops that are mostly being unused and exist primarily to deal with 128-bit pointers. In this case would still need a 96-bit TLBE whenever accessing a page in some other part of the VAS.

    ...



    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries.
    I assume the entries would be shifting between the ways for LRU. So, if
    there is a different size page in one of the ways would it still work?




    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Mon Feb 9 21:26:40 2026
    From Newsgroup: comp.arch

    On 2/9/26 12:11 AM, Robert Finch wrote:
    On 2026-02-08 9:28 p.m., Paul Clayton wrote:
    [snip]
    Andr|- Seznec proposed a way to use (overlaid) skewed
    associativity to support multiple page sizes ("Concurrent
    Support of Multiple Page Sizes On a Skewed Associative TLB",
    2003).
    [snip]

    The article requires signing in with an IEEE membership. I used
    to have a student membership many years ago. I let it expire as
    I was not using it, this was before the internet was big. Every
    so often I contemplate getting one, but I am trying to keep
    costs down.

    Duck Duck Go provided the following url: https://www.irisa.fr/caps/people/seznec/SKEWEDTLBperso.pdf
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Feb 10 19:04:40 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-09 5:27 p.m., BGB wrote:
    On 2/9/2026 2:44 PM, EricP wrote:
    ------------------------------------

    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries.
    I assume the entries would be shifting between the ways for LRU. So, if there is a different size page in one of the ways would it still work?

    One would use the not-recently-used variant of LRU.
    One accumulated used bits on a per set basis.
    When all used-bits have been set, one clears
    all used bits.

    In practice it is easier to implement and almost as good.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Feb 10 14:17:27 2026
    From Newsgroup: comp.arch

    On 2/10/2026 1:04 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-09 5:27 p.m., BGB wrote:
    On 2/9/2026 2:44 PM, EricP wrote:
    ------------------------------------

    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries.
    I assume the entries would be shifting between the ways for LRU. So, if
    there is a different size page in one of the ways would it still work?

    One would use the not-recently-used variant of LRU.
    One accumulated used bits on a per set basis.
    When all used-bits have been set, one clears
    all used bits.

    In practice it is easier to implement and almost as good.

    FWIW:
    I ended up not using an MRU or LRU for the TLB.

    While there is a loss of efficiency here (it always evicts the oldest
    page rather than the least-used), it does mean that the TLB miss handler
    can model the TLB deterministically (if the handler knows the length and width; it can be modeled exactly).

    Potentially a clever TLB miss handler could compensate by re-injecting
    the page that fell off the end if it is very likely to trigger another
    TLB miss (possibly tracking commonly-missed pages in an STF MRU or
    similar in software, and re-injecting pages that appear within this MRU
    at a higher rank than whatever is in the current last place).


    Another theoretical possibility would be to allow the CPU to access the contents of the TLB via MMIO, but there are strong reasons to not do
    this. So, naive FIFO ended up being preferable. The other possibility
    being to have LRU, but then needing to fully evict a given set whenever
    a pages' attributes are modified (such as a page swapped in or out of
    the pagefile).

    It is tradeoffs here...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Feb 10 21:56:59 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 2/10/2026 1:04 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-09 5:27 p.m., BGB wrote:
    On 2/9/2026 2:44 PM, EricP wrote:
    ------------------------------------

    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries.
    I assume the entries would be shifting between the ways for LRU. So, if
    there is a different size page in one of the ways would it still work?

    One would use the not-recently-used variant of LRU.
    One accumulated used bits on a per set basis.
    When all used-bits have been set, one clears
    all used bits.

    In practice it is easier to implement and almost as good.

    FWIW:
    I ended up not using an MRU or LRU for the TLB.

    My NRU is comprised of a single S-R flip-flop per PTE, with all of them
    wired to an (high-fan-in) AND gate.

    When all NRU-bits are set, AND gate asserts, and all reset synchronously.

    When TLB takes a miss, Fint-First logic chooses a unary index for the new
    PTE. Writing (or reading) a PTE sets its S-R FF.

    Simple and effective.

    While there is a loss of efficiency here (it always evicts the oldest
    page rather than the least-used), it does mean that the TLB miss handler
    can model the TLB deterministically (if the handler knows the length and width; it can be modeled exactly).

    Potentially a clever TLB miss handler could compensate by re-injecting
    the page that fell off the end if it is very likely to trigger another
    TLB miss (possibly tracking commonly-missed pages in an STF MRU or
    similar in software, and re-injecting pages that appear within this MRU
    at a higher rank than whatever is in the current last place).


    Another theoretical possibility would be to allow the CPU to access the contents of the TLB via MMIO, but there are strong reasons to not do
    this.

    MMI/O is potentially SLOW.

    So, naive FIFO ended up being preferable. The other possibility
    being to have LRU, but then needing to fully evict a given set whenever
    a pages' attributes are modified (such as a page swapped in or out of
    the pagefile).

    True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).


    It is tradeoffs here...

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Feb 10 14:39:07 2026
    From Newsgroup: comp.arch

    On 2/10/2026 1:56 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/10/2026 1:04 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-09 5:27 p.m., BGB wrote:
    On 2/9/2026 2:44 PM, EricP wrote:
    ------------------------------------

    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries.
    I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work?

    One would use the not-recently-used variant of LRU.
    One accumulated used bits on a per set basis.
    When all used-bits have been set, one clears
    all used bits.

    In practice it is easier to implement and almost as good.

    FWIW:
    I ended up not using an MRU or LRU for the TLB.

    My NRU is comprised of a single S-R flip-flop per PTE, with all of them
    wired to an (high-fan-in) AND gate.

    When all NRU-bits are set, AND gate asserts, and all reset synchronously.

    When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.

    Simple and effective.

    I presume NRU means "not recently used". But if when all the NRU bits
    are set, all are reset, including the MRU entry's bit, might it get overwritten?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 11 01:42:01 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 2/10/2026 1:56 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/10/2026 1:04 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-09 5:27 p.m., BGB wrote:
    On 2/9/2026 2:44 PM, EricP wrote:
    ------------------------------------

    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries.
    I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work? >>>
    One would use the not-recently-used variant of LRU.
    One accumulated used bits on a per set basis.
    When all used-bits have been set, one clears
    all used bits.

    In practice it is easier to implement and almost as good.

    FWIW:
    I ended up not using an MRU or LRU for the TLB.

    My NRU is comprised of a single S-R flip-flop per PTE, with all of them wired to an (high-fan-in) AND gate.

    When all NRU-bits are set, AND gate asserts, and all reset synchronously.

    When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.

    Simple and effective.

    I presume NRU means "not recently used". But if when all the NRU bits
    are set, all are reset, including the MRU entry's bit, might it get overwritten?

    The logic above is the simplest that performs "reasonably well", if certain timing requirements can be met, the new entry is marked recently-used. if
    not, you lose a little in event-performance but gain in pipeline frequency.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Feb 11 00:01:50 2026
    From Newsgroup: comp.arch

    On 2026-02-10 4:56 p.m., MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/10/2026 1:04 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-09 5:27 p.m., BGB wrote:
    On 2/9/2026 2:44 PM, EricP wrote:
    ------------------------------------

    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries.
    I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work?

    One would use the not-recently-used variant of LRU.
    One accumulated used bits on a per set basis.
    When all used-bits have been set, one clears
    all used bits.

    In practice it is easier to implement and almost as good.

    FWIW:
    I ended up not using an MRU or LRU for the TLB.

    My NRU is comprised of a single S-R flip-flop per PTE, with all of them
    wired to an (high-fan-in) AND gate.

    When all NRU-bits are set, AND gate asserts, and all reset synchronously.

    When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.

    Simple and effective.

    While there is a loss of efficiency here (it always evicts the oldest
    page rather than the least-used), it does mean that the TLB miss handler
    can model the TLB deterministically (if the handler knows the length and
    width; it can be modeled exactly).

    Potentially a clever TLB miss handler could compensate by re-injecting
    the page that fell off the end if it is very likely to trigger another
    TLB miss (possibly tracking commonly-missed pages in an STF MRU or
    similar in software, and re-injecting pages that appear within this MRU
    at a higher rank than whatever is in the current last place).


    Another theoretical possibility would be to allow the CPU to access the
    contents of the TLB via MMIO, but there are strong reasons to not do
    this.

    MMI/O is potentially SLOW.

    So, naive FIFO ended up being preferable. The other possibility
    being to have LRU, but then needing to fully evict a given set whenever
    a pages' attributes are modified (such as a page swapped in or out of
    the pagefile).

    True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).

    I did not realize it was so many gates for LRU? I thought it was mainly latches. The LRU in use for Qupls has the the ways connected as a shift.
    Then new entries are always added at way three for a four-way TLB. way
    zero gets evicted.

    If it is lower gate count for NRU the TLB may be switched - another
    re-write.

    The Qupls TLB is accessible as MMIO but this is probably unnecessary. It
    is also updated by a SCP now instead of a state machine. The SCP also
    handles Amiga Copper style audio/video co-processing.




    It is tradeoffs here...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Feb 11 01:02:06 2026
    From Newsgroup: comp.arch

    On 2026-02-11 12:01 a.m., Robert Finch wrote:
    On 2026-02-10 4:56 p.m., MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/10/2026 1:04 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-09 5:27 p.m., BGB wrote:
    On 2/9/2026 2:44 PM, EricP wrote:
    ------------------------------------

    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries.
    I assume the entries would be shifting between the ways for LRU.
    So, if
    there is a different size page in one of the ways would it still work? >>>>
    One would use the not-recently-used variant of LRU.
    One accumulated used bits on a per set basis.
    When all used-bits have been set, one clears
    all used bits.

    In practice it is easier to implement and almost as good.

    FWIW:
    I ended up not using an MRU or LRU for the TLB.

    My NRU is comprised of a single S-R flip-flop per PTE, with all of them
    wired to an (high-fan-in) AND gate.

    When all NRU-bits are set, AND gate asserts, and all reset synchronously.

    When TLB takes a miss, Fint-First logic chooses a unary index for the new
    PTE. Writing (or reading) a PTE sets its S-R FF.

    Simple and effective.
    While there is a loss of efficiency here (it always evicts the oldest
    page rather than the least-used), it does mean that the TLB miss handler >>> can model the TLB deterministically (if the handler knows the length and >>> width; it can be modeled exactly).

    Potentially a clever TLB miss handler could compensate by re-injecting
    the page that fell off the end if it is very likely to trigger another
    TLB miss (possibly tracking commonly-missed pages in an STF MRU or
    similar in software, and re-injecting pages that appear within this MRU
    at a higher rank than whatever is in the current last place).


    Another theoretical possibility would be to allow the CPU to access the
    contents of the TLB via MMIO, but there are strong reasons to not do
    this.

    MMI/O is potentially SLOW.

    -a-a-a-a-a-a So, naive FIFO ended up being preferable. The other possibility
    being to have LRU, but then needing to fully evict a given set whenever
    a pages' attributes are modified (such as a page swapped in or out of
    the pagefile).

    True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).

    I did not realize it was so many gates for LRU? I thought it was mainly latches. The LRU in use for Qupls has the the ways connected as a shift. Then new entries are always added at way three for a four-way TLB. way
    zero gets evicted.

    If it is lower gate count for NRU the TLB may be switched - another re- write.

    The Qupls TLB is accessible as MMIO but this is probably unnecessary. It
    is also updated by a SCP now instead of a state machine. The SCP also handles Amiga Copper style audio/video co-processing.


    I quickly modified the Qupls TLB for NRU updates and then compared NRU,
    LRU, and RANDOM
    // TLB is four-way 512 entries.
    // 3000 LUTs / 500 FFs / 8 BRAMs / 205 MHz RANDOM
    // 2900 LUTs / 500 FFs / 8 BRAMs / 205 MHz LRU
    // 2900 LUTs / 500 FFs / 8 BRAMs / 205 MHz NRU

    I did not get the results I was expecting. In my FPGA implementation
    they are all close to the same. The implementation just has a single
    ported TLB so there is some muxing for updates. It could be all the
    other logic drowns out the difference.


    It is tradeoffs here...



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 11 18:15:25 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-10 4:56 p.m., MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/10/2026 1:04 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-09 5:27 p.m., BGB wrote:
    On 2/9/2026 2:44 PM, EricP wrote:
    ------------------------------------

    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries.
    I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work? >>>
    One would use the not-recently-used variant of LRU.
    One accumulated used bits on a per set basis.
    When all used-bits have been set, one clears
    all used bits.

    In practice it is easier to implement and almost as good.

    FWIW:
    I ended up not using an MRU or LRU for the TLB.

    My NRU is comprised of a single S-R flip-flop per PTE, with all of them wired to an (high-fan-in) AND gate.

    When all NRU-bits are set, AND gate asserts, and all reset synchronously.

    When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.

    Simple and effective.

    While there is a loss of efficiency here (it always evicts the oldest
    page rather than the least-used), it does mean that the TLB miss handler >> can model the TLB deterministically (if the handler knows the length and >> width; it can be modeled exactly).

    Potentially a clever TLB miss handler could compensate by re-injecting
    the page that fell off the end if it is very likely to trigger another
    TLB miss (possibly tracking commonly-missed pages in an STF MRU or
    similar in software, and re-injecting pages that appear within this MRU
    at a higher rank than whatever is in the current last place).


    Another theoretical possibility would be to allow the CPU to access the
    contents of the TLB via MMIO, but there are strong reasons to not do
    this.

    MMI/O is potentially SLOW.

    So, naive FIFO ended up being preferable. The other possibility
    being to have LRU, but then needing to fully evict a given set whenever
    a pages' attributes are modified (such as a page swapped in or out of
    the pagefile).

    True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).

    I did not realize it was so many gates for LRU? I thought it was mainly latches. The LRU in use for Qupls has the the ways connected as a shift. Then new entries are always added at way three for a four-way TLB. way
    zero gets evicted.

    True LRU for 4-way set is 26-gates plus 4 D-type flip-flops.
    There are just too many states and too many state transitions.

    MRU is 1 logic gate plus 1 SR-flip-flop per entry and a FF0 across
    the whole set.

    You could never make true LRU for 64-E TLB, but MRU is too small to see.

    We (logic designers) dislike shift registers, but I think your
    use is clever.

    If it is lower gate count for NRU the TLB may be switched - another re-write.

    The Qupls TLB is accessible as MMIO but this is probably unnecessary. It
    is also updated by a SCP now instead of a state machine. The SCP also handles Amiga Copper style audio/video co-processing.




    It is tradeoffs here...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Feb 11 18:06:11 2026
    From Newsgroup: comp.arch

    On 2/10/2026 3:56 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/10/2026 1:04 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-09 5:27 p.m., BGB wrote:
    On 2/9/2026 2:44 PM, EricP wrote:
    ------------------------------------

    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries.
    I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work?

    One would use the not-recently-used variant of LRU.
    One accumulated used bits on a per set basis.
    When all used-bits have been set, one clears
    all used bits.

    In practice it is easier to implement and almost as good.

    FWIW:
    I ended up not using an MRU or LRU for the TLB.

    My NRU is comprised of a single S-R flip-flop per PTE, with all of them
    wired to an (high-fan-in) AND gate.

    When all NRU-bits are set, AND gate asserts, and all reset synchronously.

    When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.

    Simple and effective.



    All pose a similar issue though in that they introduce behavior that is
    not visible to the miss handler, and thus the exact state of the TLB
    can't be modeled in software. Well, unless maybe there were some good
    way to estimate this in SW and with a LDTLB which encoded which way to
    load the TLBE into.

    LDTLB: Load register pair into TLB, push-back by 1.
    LDTLB{1/2/3/4}; Load into 1st/2nd/3rd/4th way of TLB.


    Then again, in this case, NRU state would be 1024 bits, and could
    potentially be made accessible via a special mechanism (say, an LDNRU instruction). Would still be too bulky in this case to route it through
    CPUID, and wouldn't want to get the MMIO mechanism involved.


    Well, maybe could special case-it:
    MMIO requests to the TLB need not make it all the way to the MMIO bus,
    the TLB could just see the request passing by and be like "hey, that
    one's for me" and handle it directly as it passes by (a similar
    mechanism having already used for LDTLB, which sort of reach the TLB
    like specially malformed memory stores).


    If going this route, could maybe make sense to allow an access per slot,
    where one can fetch accessed (A) and Dirty (D) bits, possibly along with
    any other state that is being tracked (probably all 0 for now).


    While there is a loss of efficiency here (it always evicts the oldest
    page rather than the least-used), it does mean that the TLB miss handler
    can model the TLB deterministically (if the handler knows the length and
    width; it can be modeled exactly).

    Potentially a clever TLB miss handler could compensate by re-injecting
    the page that fell off the end if it is very likely to trigger another
    TLB miss (possibly tracking commonly-missed pages in an STF MRU or
    similar in software, and re-injecting pages that appear within this MRU
    at a higher rank than whatever is in the current last place).


    Another theoretical possibility would be to allow the CPU to access the
    contents of the TLB via MMIO, but there are strong reasons to not do
    this.

    MMI/O is potentially SLOW.


    Beyond this... If we are already paying the cost of a exception handler
    for TLB Miss, then some extra MMIO isn't likely a huge issue.

    Bigger issue is that it requires a special mechanism to do so, which
    could risk an undesirable level of resource cost impact to the TLB. This
    could be reduced if the access pattern for the TLB matches that of its
    normal operation though.


    If I were to design such an MMIO interface, maybe:
    ( 2:0): Always 0, 64-bit MMIO only
    ( 4:3): Word Selector (Provision for 256-bit TLBEs; Currently 128-bit)
    ( 7:5): TLB Way, make provision for 8-way TLB
    (17:8): TLB Slot, making provision for 1024x TLB
    (27:18): Constant Pattern, TBD
    (47:28): FFFFF, Lower-range MMIO Space

    Some quick/dirty mental mathing implies a base address of probably:
    FFFF_FE000000..FFFF_FE03FFFF

    Might need more address range if I want additional metadata, or
    provision for a bigger TLB.

    Or similar...


    Can note that SH-4 had an MMIO interface for the TLB, but initially I
    didn't want to go this way.


    So, naive FIFO ended up being preferable. The other possibility
    being to have LRU, but then needing to fully evict a given set whenever
    a pages' attributes are modified (such as a page swapped in or out of
    the pagefile).

    True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).


    oK.

    I guess I can note that my use of LRU/MRU was most often based on STF,
    in a naive case:
    Access 3: Swap 3 and 2
    Access 2: Swap 2 and 1
    Access 1: Swap 1 and 0
    Access 0: No Change

    With insertion adding to the front and pushing everything back.

    Some more advanced ranking would involve access counts and swapping
    based on relative access frequency (to give a "true ranking order"), but
    I haven't usually done this (usually not worth the added cost vs the approximate answer that the STF approach gives).



    IIRC I had used a more convoluted pattern to deal with Addr96 though, so (uploading an Addr48 TLBE):
    1:0 -> 3:2
    2 -> 1 (if it was a 48-bit TLBE)
    New -> 0
    1 -> Discarded

    With the Addr96 case being:
    1:0 -> 3:2
    New -> 1:0

    Rather than a more naive pattern for Addr48:
    3 -> Discarded
    2 -> 3
    1 -> 2
    0 -> 1
    New -> 0

    The more convoluted pattern mostly to deal with 96-bit addressing as
    paired TLBEs.




    Well, sort of related to a recent experiment with trying to add
    networking support to a 3D engine of mine (my BT3 engine), and ended up instead trying to use behavioral determinism to sync mobs rather than
    explicit messages.

    The code was originally written to use random numbers, but then noted I
    could use a 5D hash (X/Y/Z, Time, Aux) instead of a more traditional
    RNG. Basically works, and easier to make deterministic in a wold where
    what things are loaded in are variable (the Doom approach of using a globally-synchronized RNG would be N/A in this case).


    Though, for something like a mob, maybe difficult to imagine. One could imagine the mob as a sort of particle floating along with Brownian
    motion in a space where each spot along its path already has a vector
    waiting to push it along as soon as it arrives. If you can already
    determine the vectors, then starting from a known initial state it
    should be possible to step forwards until the current time and end up
    with everything in the same positions without needing any
    synchronization messages to be sent between the clients.


    Seemed kinda clever, but apparently isn't all that atypical.
    Grok claimed that this same basic approach was fairly common in RTS
    games though (eg, like StarCraft and WarCraft and similar), with some
    other ideas I was floating here being more typical of MMOs.


    Though, for terrain generation in my 3D engines had already tended to
    use spatial hashing though.

    Say, for example, in my current 3D engine for things like height-map:
    Fill a lookup table with random numbers generated from a PRNG (256 entries); Hash the X/Y or X/Y/Z coords ("(((X*P1+Y*P2+Z*P3)*P4)>>SH)&255");
    Select an index from the lookup table, and repeat with adjusted indices
    for adjacent entries;
    Interpolate between these.

    In this case, plain spatial hashing RNG had already been used for things
    like placing trees and similar (where the whole process needs to be deterministic across the space).

    Adding a time aspect and the potential of player interactions does
    complicate things, as any direct interaction between a player and a mob effectively breaks determinism (and a replay would require actively
    recording and replaying player movements and actions as part of the simulation, but possibly the server could keep track of this, where
    every player is for every tick, and trigger a partial re-sim in the case
    of known determinism breakages). Either this, or the server itself plays
    the sim forward (from the last reference-state) with the players added
    in, and then makes a new snapshot post interaction which is then sent to
    any relevant clients (as the new reference state).

    Well, along with imposing some behavioral tweaks, like if no one is sufficiently nearby then the region borders will behave as invisible walls.

    Say, for example, distance from nearest player:
    0-47 meters: Full behavior;
    48-95: Limiting advanced behaviors, region borders become walls;
    96+: Mobs stop moving and AI behavior is disabled;
    ~ 128-192: Ticking stops entirely,
    Limbo state until it exceeds render distance and fully unloads.


    Can also use a similar sort of spacetime-hash RNG for things like
    block-ticks (such as dirt blocks turning into grass), which also
    shouldn't need explicit synchronization.

    Oddly, in some of this stuff, there does seem to be curious physics
    parallels though.

    ...



    It is tradeoffs here...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 12 02:16:13 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 2/10/2026 3:56 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/10/2026 1:04 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-09 5:27 p.m., BGB wrote:
    On 2/9/2026 2:44 PM, EricP wrote:
    ------------------------------------

    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries.
    I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work? >>>
    One would use the not-recently-used variant of LRU.
    One accumulated used bits on a per set basis.
    When all used-bits have been set, one clears
    all used bits.

    In practice it is easier to implement and almost as good.

    FWIW:
    I ended up not using an MRU or LRU for the TLB.

    My NRU is comprised of a single S-R flip-flop per PTE, with all of them wired to an (high-fan-in) AND gate.

    When all NRU-bits are set, AND gate asserts, and all reset synchronously.

    When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.

    Simple and effective.



    All pose a similar issue though in that they introduce behavior that is
    not visible to the miss handler, and thus the exact state of the TLB
    can't be modeled in software. Well, unless maybe there were some good
    way to estimate this in SW and with a LDTLB which encoded which way to
    load the TLBE into.

    LDTLB: Load register pair into TLB, push-back by 1.
    LDTLB{1/2/3/4}; Load into 1st/2nd/3rd/4th way of TLB.

    MIPS (company not Stanford) had the TLE logic find the entry to be
    written (find first), so SW just hashed a big table, checked match
    and if so, inserted new PTE where HW said.

    IF you are going to do a SW TLB reloader, this is the way.

    Then again, in this case, NRU state would be 1024 bits, and could potentially be made accessible via a special mechanism (say, an LDNRU instruction). Would still be too bulky in this case to route it through CPUID, and wouldn't want to get the MMIO mechanism involved.


    Well, maybe could special case-it:
    MMIO requests to the TLB need not make it all the way to the MMIO bus,
    the TLB could just see the request passing by and be like "hey, that
    one's for me" and handle it directly as it passes by (a similar
    mechanism having already used for LDTLB, which sort of reach the TLB
    like specially malformed memory stores).

    In wide super-scalar machines all of the memory 'paths' need this logic,
    and sometimes ordering of MMI/O becomes a "little tricky". ---------------------
    MMI/O is potentially SLOW.


    Beyond this... If we are already paying the cost of a exception handler
    for TLB Miss, then some extra MMIO isn't likely a huge issue.

    MIPS TLB reloader was 17 cycles (when instructions were in cache)
    I have witnessed MMI/O that was slower than 1-|s !! Also the control
    register bus in Athlon/Opteron was SLOW, too:: so you can't always
    use control registers in lieu of MMI/O ...
    -----------------

    True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).


    oK.

    I guess I can note that my use of LRU/MRU was most often based on STF,
    in a naive case:
    Access 3: Swap 3 and 2
    Access 2: Swap 2 and 1
    Access 1: Swap 1 and 0
    Access 0: No Change

    SWAP moves the whole size of PTE !! whereas NRU sets 1 bit.
    Would never be allowed in a low-power implementation.
    ----------------
    Say, for example, in my current 3D engine for things like height-map:
    Fill a lookup table with random numbers generated from a PRNG (256 entries); Hash the X/Y or X/Y/Z coords ("(((X*P1+Y*P2+Z*P3)*P4)>>SH)&255");
    Select an index from the lookup table, and repeat with adjusted indices
    for adjacent entries;
    Interpolate between these.

    One of the cool things about HW hashing is bit reversal of -+ the fields in
    the hash--goes a long way to 'whiten' the noise signature.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Feb 12 16:17:50 2026
    From Newsgroup: comp.arch

    On 2/11/2026 8:16 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/10/2026 3:56 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/10/2026 1:04 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-09 5:27 p.m., BGB wrote:
    On 2/9/2026 2:44 PM, EricP wrote:
    ------------------------------------

    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries. >>>>>> I assume the entries would be shifting between the ways for LRU. So, if >>>>>> there is a different size page in one of the ways would it still work? >>>>>
    One would use the not-recently-used variant of LRU.
    One accumulated used bits on a per set basis.
    When all used-bits have been set, one clears
    all used bits.

    In practice it is easier to implement and almost as good.

    FWIW:
    I ended up not using an MRU or LRU for the TLB.

    My NRU is comprised of a single S-R flip-flop per PTE, with all of them
    wired to an (high-fan-in) AND gate.

    When all NRU-bits are set, AND gate asserts, and all reset synchronously. >>>
    When TLB takes a miss, Fint-First logic chooses a unary index for the new >>> PTE. Writing (or reading) a PTE sets its S-R FF.

    Simple and effective.



    All pose a similar issue though in that they introduce behavior that is
    not visible to the miss handler, and thus the exact state of the TLB
    can't be modeled in software. Well, unless maybe there were some good
    way to estimate this in SW and with a LDTLB which encoded which way to
    load the TLBE into.

    LDTLB: Load register pair into TLB, push-back by 1.
    LDTLB{1/2/3/4}; Load into 1st/2nd/3rd/4th way of TLB.

    MIPS (company not Stanford) had the TLE logic find the entry to be
    written (find first), so SW just hashed a big table, checked match
    and if so, inserted new PTE where HW said.

    IF you are going to do a SW TLB reloader, this is the way.


    The LDTLB mechanism isn't that much different from the one in SH-4, but differs in a few ways:
    SH-4 used MMIO registers for the TLBE to load-in;
    BJX2 uses R1:R0 as fixed registers.

    Also different TLB shape:
    SH-4:
    64-entry fully associative;
    LDTLB pushes whole TLB back by 1 entry,
    adding new TLBE to the front of the list;
    With oldest entry then falling off the end.
    BJX2:
    256 x 4-way;
    Had experimented with swapping,
    but ended up not using it for sake of determinism.

    Both use an 0R instruction that serves the role of taking whatever TLBE
    is presented to it and shoving it into the TLB in the place it goes.


    One other practical difference was that it was possible to read TLB
    contents via MMIO on SH-4, but this had not been carried over into my projects. Initially, I was just sort of like "WTF?" and didn't see it as particularly useful. Though now I can at least see some possible
    use-cases, though maybe debatable if enough of a use case to actually
    justify adding something like this.




    Then again, in this case, NRU state would be 1024 bits, and could
    potentially be made accessible via a special mechanism (say, an LDNRU
    instruction). Would still be too bulky in this case to route it through
    CPUID, and wouldn't want to get the MMIO mechanism involved.


    Well, maybe could special case-it:
    MMIO requests to the TLB need not make it all the way to the MMIO bus,
    the TLB could just see the request passing by and be like "hey, that
    one's for me" and handle it directly as it passes by (a similar
    mechanism having already used for LDTLB, which sort of reach the TLB
    like specially malformed memory stores).

    In wide super-scalar machines all of the memory 'paths' need this logic,
    and sometimes ordering of MMI/O becomes a "little tricky". ---------------------

    Yeah.
    I haven't yet moved past 1 memory port, for the whole reason that the
    jump from 1 to 2 memory ports is steep;
    It is likely though that higher throughput could be possible in theory
    though if 2 memory ports could be supported.


    MMI/O is potentially SLOW.


    Beyond this... If we are already paying the cost of a exception handler
    for TLB Miss, then some extra MMIO isn't likely a huge issue.

    MIPS TLB reloader was 17 cycles (when instructions were in cache)
    I have witnessed MMI/O that was slower than 1-|s !! Also the control
    register bus in Athlon/Opteron was SLOW, too:: so you can't always
    use control registers in lieu of MMI/O ...
    -----------------

    In my case, the TLB reload is far more than this (hundreds of cycles),
    but mostly infrequent enough to not be a huge issue.

    Though, part of this is because a 256x4 TLB tends to fit patterns a lot
    better than a 64-entry full-assoc TLB would have, as long as being *far* cheaper (uses some BRAMs rather than an absurd number of LUTs).



    True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).


    oK.

    I guess I can note that my use of LRU/MRU was most often based on STF,
    in a naive case:
    Access 3: Swap 3 and 2
    Access 2: Swap 2 and 1
    Access 1: Swap 1 and 0
    Access 0: No Change

    SWAP moves the whole size of PTE !! whereas NRU sets 1 bit.
    Would never be allowed in a low-power implementation.
    ----------------

    Possibly.

    As noted, I ended up not using it, in favor of always pushing entries front-to-back.


    Say, for example, in my current 3D engine for things like height-map:
    Fill a lookup table with random numbers generated from a PRNG (256 entries); >> Hash the X/Y or X/Y/Z coords ("(((X*P1+Y*P2+Z*P3)*P4)>>SH)&255");
    Select an index from the lookup table, and repeat with adjusted indices
    for adjacent entries;
    Interpolate between these.

    One of the cool things about HW hashing is bit reversal of -+ the fields in the hash--goes a long way to 'whiten' the noise signature.


    Yeah.

    In software, one is mostly just stuck with "multiply by primes" and XOR
    and similar.

    There is some "magic" that can be gained with Morton shuffles in
    hashing, but this needs special CPU instructions to have much advantage
    over multiply-by-primes or similar (Morton shuffle via lookup tables
    tending to have higher latency than by using a multiply).

    Shift+XOR can do some special things, but naively shifting and XOR'ing multiple values together tends to be worse than
    multiply-by-primes-and-add (so, worse from a hashing POV).

    Bit reversal would be nice, but alas...


    But, yeah, with hashed indexing into a fixed-size lookup table, one can
    get a surprising variety of terrain height-maps. Well, maybe "sameness" existing on the macro scale, if someone looks at a lot of world seeds.


    There are some possible improvements to the algo, but don't really want
    to poke at it as changing the algo would break the existing terrain generation. Well, and I have dealt with Minecraft enough to know what
    this mess looks like (though, luckily has become much less of an issue
    in recent years, but in the past, it was common for new versions of the
    game to break terrain generation in ways which left big ugly seams in
    the world where the terrain in newly generated chunks doesn't match up
    with the older chunks, ...).




    One difference exists here between BT2 and BT3 engines:
    BT2 used the hash value directly, and interpolated between the hashed
    values;
    BT3 routes it though a lookup table that is filled from the starting
    seed (holding values nominally in the range of -1.0 to 1.0).

    Both differ from Perlin Noise which used a more complex process, IIRC:
    Normalized unit vectors;
    Interpolating between unit vectors;
    Taking the dot-product with another vector to get a scalar.

    Made sense to drop the vector math parts, as one can get basically the
    same effect by just interpolating linear values.


    The original BGBTech (BT1) engine had used something more like more traditional Perlin noise.

    Also, like BT2/BT3 also technically had a wrapping world, but in this
    case the terrain generation didn't wrap cleanly so at the edge there
    would be an abrupt seam.



    So, say:

    BT1: World had signed coordinates.
    Theoretically wrapped, but world-seam issues were never addressed.
    Engine performance and memory use was too bad to allow for much travel
    (so initial terrain generation had big walls at 1km from the origin); Theoretically, if the world seam were reached, it would have been an
    abrupt mismatch in the terrain generation, and teleporting the player to
    the opposite side once crossed.

    BT2:
    Also wrapped, using 20.12 fixed-point coords for many things;
    In this case, had (sorta) interpreted the coordinates as signed, but
    left signed/unsigned handling as ambiguous;
    Tested seam, clean wrap on seam actually worked.

    BT3:
    Wraps, but 16.8 (X/Y) and 9.7 (Z) in this case, mostly because I had
    used a packed 64-bit format for world coords (rather than a struct). Z
    was originally 8.8, but I expanded the max world height.
    Couldn't go to 10.6 though as this broke the ray-casting.

    So, alas, both more range and precision would be nice, but would need to rework the engine to go to doing all of this stuff with structs rather
    than bit-packed values. But, if I went to a struct here, could maybe add
    a world-scale option to go back to a 1024km cube for the world size.

    Original reason for the packed 64-bit coords though was to make it
    cheaper on the BJX2 core, but ironically in this case could also allow
    for packed "__int128" here (supported in BGBCC, but not in MSVC).

    All this does mean that the maximum world size in BT3 is ~ 64 km (vs
    1024 km in BT2).


    But, alas, reason for BT3's existence was mostly because BT2 would have
    been too heavyweight to run on a 50 MHz BJX2 core. But, trying to scale
    it up (to give a better experience on a desktop PC use case) kinda
    conflicts with a lot of its original design goals.



    Also in BT3, the world coords are currently seen as nominally unsigned,
    with the spawn location right near the seam.

    For expanded world height, the BT3 engine is like the BT2 engine in that
    it works by stacking regions in the Z axis rather than by making regions taller. The BT2 engine could have supported fully cube-shaped worlds
    though, but the BT3 engine can't currently do this.


    All 3 engines differed from Minecraft though in that the world is seen
    as wrapping rather than an infinite plane.


    Also maybe ironically as well, in the BT2 engine, the original world
    type was not an open Earth-like surface world, but rather an endless 3D
    cave system. The handling of an "overworld" was itself a special case
    built "inside" the 3D cave system via the terrain generator.

    So, in this case, you weren't "outside" as a property of the world, but
    rather because of there being a "sky" block within a certain distance
    over the player's head. So, the terrain generator signaled "overworld"
    by lining the top of the generated regions with sky blocks (if one were
    to mine through the sky layer, they could potentially end up back in the caves).

    One other gimmick that the BT2 engine had that the BT3 engine lacks, was
    the ability to effectively rotate the world relative to the player (such
    as inverting gravity). But, it could have also allowed for spherical,
    hollow sphere, or cube/hollow cube worlds, ... In this case, the gravity direction was a property of the player (with the player's yaw/pitch/roll relative to the "local gravity" coordinate space).

    Could in theory bring this back, but would need to go back to
    struct-based coordinates to make use of this.


    Some other differences:
    BT2 used a 256x256x256 region size (16x16x16 chunks)
    BT3 uses a 128x128x128 region size (8x8x8 chunks)
    Reduction in region size was mostly for memory saving thing.
    Does mean more loaded regions for a given draw-distance though.
    BT2 had used an STF+AdRice+LZ77 based compressor for chunks.
    BT3 used RP2 (byte oriented LZ, no native entropy coder)
    But recently has gained an optional entropy post stage,
    Used if it will get a sufficient size reduction to justify it.
    Current entropy post-encoders being STF+AdRice and Range Coding.
    STF+AdRice: Faster, but not as strong.
    Range Coding: Better compression, but very slow.
    Raw Bytes: Fastest option.
    Mostly uses raw bytes in most cases
    When a chunk LZ compresses to 200-300 bytes,
    entropy coding doesn't do much.
    If its mostly larger chunks (700-1K or more)
    where it "actually does something".

    Contrast:
    The original BGBTech / BT1 engine had used RLE, basically the same RLE
    scheme as Wolf3D and ROTT (3-byte RLEs: Escape, Length, Value; Escape
    chosen to be the least-used byte). However, RLE was not particularly
    effective here.



    Initially, the STF+AdRice post compressor was entirely ineffective, but
    I figured out that there was a tweak that could be made to the AdRice to
    gain around 10% or so for entropy (putting it at around 6% worse than
    the range coder), and thus "actually effective".

    Note as for the lack of Huffman:
    Static (table driven) Huffman is ineffective with small data.
    Space needed to encode a Huffman table eats any gains,
    even with Deflate's Huffman-encoded Huffman-tables trick.
    Also, setting up the tables is expensive (making it slow).
    Global static tables suck
    And, if not a perfect fit, are worse than AdRice.
    Vitter's Algorithm (Adaptive Huffman) is stupidly slow,
    manages to somehow make a Range Coder seem performant.



    Note that STF is part of the "secret sauce" here in that it allows
    AdRice to be made to act more like a Huffman coder.

    Where, in this case:
    STF: Swap Towards Front
    Typically to ((I*15)/16), best results IME (vs 7/8 or 31/32).
    MTF: Move to Front
    Symbol moves to front, shifting all other symbols back.
    Both slower and less effective than STF for this.
    SMTF: Swapping Move To Front.
    Typically behaves like STF for higher-probability symbols.
    Top 32/64/128 typically behaving like in STF.
    Rotates space and swaps symbol into new index 0 if low probability.
    More complex than plain STF, effectiveness depends on data type.

    One way to improve accuracy of STF is to add a counter, and only doing
    the swap if the counter for a symbol exceeds the count of the
    destination symbol. However, the usage of counters and conditional
    swapping has a fairly steep performance cost (though, still not quite as
    bad as using Range Coding).

    ...



    For textures:

    BT1: Had used BTJPG, or a tweaked version of the T.81 JPEG format,
    mostly tweaked to allow for optional lossless coding and alpha channel.
    The lossless mode worked by switching over to RCT and SERMS-RDCT vs
    YCbCr and DCT, but RDCT was slow (and only used if all quantization
    parameters were 1). Alpha channel was handled by embedding the alpha
    channel as a monochrome image within an APP11 tag.

    IIRC, the merit of the hacky JPEG variant was that it was typically
    still more space efficient and faster to decode than PNG (despite its conceptual simplicity, high performance PNG decoding is difficult, particularly if the goal is to transcode to DXT5 or BC7 or similar).


    BT2: Had used my BT4B format, which was natively color-cell based,
    moderately fast to decode, but inherently lossy and prone to artifacts.


    BT3: Initially used DDS. Had considered using a variant of my BT5B
    format, but by its design BT5B couldn't give better inherent Q/bpp than
    DDS. BT5B essentially followed in similar footsteps to RPZA and MS-CRAM.


    Ended up switching BT3 over to my UPIC format, which ironically was
    designed mostly as a cheaper alternative to T.81 JPEG, but following a
    similar general structure:
    8x8 transform blocks organized into 16x16 macroblocks with 4:2:0 or
    4:4:4 subsampling, though borrowing a few design tricks from the Theora
    codec as well.

    Like with BTJPG, was able to implement a fast transcoder to DXT1 (could
    maybe also use BC7H; but for now the BT3 engine is just using DXT1).


    BT2 had used exclusively sprite graphics.

    BT1 had used 3D models, mostly using the AC3D and Valve SMD formats.

    BT3 initially used sprites, but now also using 3D models, mostly in a
    custom BMD format, generally being generated from SCAD scripts ATM.
    Currently, SCAD lacks any good way to do textures or skeletal animation,
    but I lack any good/obvious way to add these onto the SCAD language in a
    way that wouldn't break compatibility with OpenSCAD.

    Could do something like "/*[[bone("LArm")]]*/" but this sucks...

    Could just do my own SCAD viewer, but, ... Well, forking the SCAD
    language dilutes the merit of using SCAD, though there would still be a
    common subset for untextured non-animated models.

    Idea for animation in this case is mostly to assign CSG solids (or other groups/unions) to bones on a skeleton, and then by moving the bone it
    moves the relevant part of the model.

    In this case, the animation loops would also be procedural, could
    possibly also move it over to representing these as an extension of the
    SCAD language.

    But, if going this route, could maybe also use the C preprocessor to
    allow including other SCAD files for reusable components (well, and the "already implicitly exists" extension of also supporting integer math;
    where OpenSCAD effectively lacks support for integer math operations).
    Well, and C style "for()" loops (it has its own very wonky for loops).



    For audio:
    Ironically, thus far, all 3 engines have ended up back at ADPCM.
    Thus far, have not found an audio format that shows clear advantage over
    using some variation of ADPCM for sound effects (typically IMA ADPCM; technically there is also "Microsoft ADPCM" but they kinda screwed it up).

    The 2nd place option is mostly A-Law waves (typically slightly better
    quality, but larger size). Where, 16-bit PCM is needlessly bulky, and
    8-bit PCM sounds kinda like crap (despite 11 kHz 8-bit PCM being the
    "gold standard" for 90s era games).

    Granted, 8 kHz IMA ADPCM isn't exactly peak audio quality either, but it
    is also like 3/8 the size of said 11kHz 8-bit PCM. Can jump to 16kHz
    when needed though, and still be smaller; or use 2-bit ADPCM when
    smaller sizes are needed.

    Seemingly, not much else competes well in 16-32 kbps territory.
    Haven't been able to push much lower and still get passable quality
    (8-10 kbps mostly still out of reach).


    Closest I got to something usable was, ironically, doing some
    statistical trickery in an ADPCM encoder and then using LZ + Range
    Coding on it. Where, the idea is than the encoder can remember
    previously seen patterns and then try to fit the following audio to a previously seen pattern if it is a "close-enough" match as per an error heuristic (basically, trying to make the waveform more easily LZ
    compressible by doing something similar to LZ and replaying the prior
    pattern if it is close enough to a match).

    But, the ability to improve the LZ compression of ADPCM audio as an
    encoder feature isn't exactly a break from ADPCM. Still fell short of
    solidly breaking into "sub 10 kbps" territory.



    Well, and for asset packing:
    BT1: Tweaked ZIP
    BT2: BTPAK.
    BT3: WAD4.

    Where, BTPAK was similar to what stuff I needed to do with ZIP to make
    it "not suck" for a VFS. It had mostly used a Deflate based data
    compression scheme (mostly a variant of Deflate64 tweaked to further
    expand the size of the sliding window: Deflate had a 32K windows;
    Deflate64 had 64K; I had expanded it to around 1MB IIRC, partly noting
    as Deflate's distance encoding did not use the full byte-range for
    distance codes, though IIRC did require modifying the headers, as the
    original encoding for the Huffman-coded-huffman-tables didn't allow the
    full byte range for the match distance table).

    WAD4 is a further simplification, and exists more as a logical expansion
    of the WAD2 format. WAD4 is mostly using LZ4 and RP2 compression. Wasn't
    using Deflate mostly as it was difficult to make Deflate decoding cheap
    (so it makes sense to, if using Deflate, to only use it if it can beat
    LZ4 or RP2 by some certain percentage).

    Say, for example:
    Only use Deflate if it is at least 20% smaller than LZ4 or RP2.

    WAD4 was itself mostly to reduce overhead and complexity if compared
    with BTPAK.

    One major use-case for packing and compression being to accelerate
    loading, but this goal is defeated if the overhead of the decompression
    is higher than that of the bandwidth from an HDD or similar (so, say, on
    a desktop PC, one needs at leads 300-500 MB/s from the LZ decoder to significantly beat the HDD; or ~ GB/s if reading from an SSD).

    Values are lower on BJX2 Core + SDcard, but similar idea applies:
    LZ4 and RP2 can beat IO to the SDcard;
    Deflate, solidly, could not (very much slower than SDcard).


    Curiously, the claim was that Deflate/ZIP was originally considered as
    being "very fast", but then, what exactly did it replace?...


    Well, there was LZSS, but this kinda sucked in a different way:
    Injecting bytes to encode literal/match bits for next 8 items;
    If 8 items consumed, read another byte for 8 more bits.
    If bit was 0, literal byte;
    If bit was 1, read 16 bits,
    with 4 bit length and 12 bit offset.
    Match copying was modulo, which kinda sucked.

    Wouldn't imagine it being particularly slow, but granted, it sucks
    pretty hard if compared with LZ4 or similar.


    Then again, I guess maybe Deflate could have been used for IO
    acceleration if reading from a floppy, but the needed memory overheads
    would likely have been painful for real-mode MS-DOS.

    Then again, most 1980s era filesystems did not use native compression
    (and then things like DoubleSpace/DriveSpace were seen as slower than
    just using an uncompressed FAT filesystem).


    Well, and on modern Windows, another big time cost (and a merit of using
    a VFS) is the comparably high cost of opening and closing files.

    Well, also printing debug messages into a console is annoyingly slow (if
    one has a big "printf storm" this can itself become a bottleneck, as it
    seems the Windows console has a speed of somewhere around 200
    lines/second or something).


    So, dunno...


    ...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Feb 8 19:37:31 2026
    From Newsgroup: comp.arch


    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    kegs@provalid.com (Kent Dickey) posted:

    In article <1770238707-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    kegs@provalid.com (Kent Dickey) posted:

    An advantage of hashed page tables is the TLB itself has no "intermediate"
    entries. This lets the primary TLB hold more useful translations. This >> is not much of an issue if the primary TLB is >= 256 entries, but it's >> really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
    page tables advantages.

    Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by >a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary" >TLB (or not with explanation) ?!?

    I think the 32-entry fully associative is the primary TLB. The 1024 entry is a secondary TLB. Hits in the primary TLB cause no stalls at all.
    Misses in the primary probably lead to stalls (but it is possible the
    6 cycle latency can be hidden sometimes), so it costs performance.

    If the primary TLB (the one with no stalls) is only 4 entries, then
    we can easily calculate a pretty substantial stall penalty for most programs since it will pay the 6 cycle stall a lot (and that's for a L2
    TLB hit). So we make the L1 TLB as big as possible, but it cannot be made too large or its too slow, so it ends up between 32- and 128-entries
    (256 tops).

    I'm just pointing out the intermediate hierarchical entries (the page
    table entries which are just page table pointers) need to cached in
    the TLB to make future lookups faster.

    Cached yes, in TLB no--generally cached in TableWalk Accelerator store.

    And putting them there reduces
    the size of the TLB for "useful" entries.

    Which is why you don't put 'em there.

    You can try to avoid this penalty by only putting page table pointer entries in the L2 TLB, but then you're just making primary TLB misses slower, especially since the L2 TLB lookup cannot scan all entries like the primary can--you have to probe
    each one separately.

    If you want to translate the address 0x1234_56789000 using 4KB pages, you can do a SINGLE primary associative TLB lookup, and look for:

    1) Page entry for 0x1234_56789000
    2) Page-table pointer for 0x1234_566xxxxx
    3) Page-table pointer for 0x1234_4xxxxxxx
    4) Page-table pointer for 0x120x_xxxxxxxx

    (you may need do a second lookup to get page-table entries, since it's expensive to deal with multiple hits in one clock).

    Another reason to put them other than in TLB. One can make all 4 accesses simultaneously, and start the tablewalk as far down the tree as possible.

    For example, Ross HyperSPARC used 6 flip-flops as a TWA. When a TLB was
    being accessed, the 6 flip flops would compare the VA, so when a TLB miss
    took place, we already knew what layer of the page tables to start. We
    used 4 flip-flops for Data TLB PTPs and 2 for code PTPs. This very tiny
    TWA removed just over 1/2 of the cycles of table walking, overall.
    Each flip-flop knew which layer in the TW it covered.

    A modern TWA might contain a cache line from "all" the top layers and
    4 cache lines from the bottom layer--and one could easily multiply that
    by 2|u to capture nested paging and still TWA is smaller than a single
    64E TLB while getting rid of 70% of table walk accesses to L2.

    With the page-table-pointer for 0x1234_566xxxxx, you then do a single
    data cache lookup to get the page entry (say 3 more clocks). So putting these page-table pointers in the L1 TLB has a big benefit to the page
    walk times.

    A delicate balance, especially if L2 is less than 10 cycles. ------------------------
    Accounting for the overhead, it seems like almost any process will waste
    at least 7 L1 TLB entries to these intermediate page table pointer
    entries, so your 32-entry L1 TLB will generally hold no more than 25
    useful pages.

    So, you don't put 'em in TLB you put 'em in TW-Accelerator Store.

    Kent
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Feb 8 16:57:15 2026
    From Newsgroup: comp.arch

    On 2/5/2026 11:23 PM, Kent Dickey wrote:
    In article <1770238707-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    kegs@provalid.com (Kent Dickey) posted:

    An advantage of hashed page tables is the TLB itself has no "intermediate" >>> entries. This lets the primary TLB hold more useful translations. This >>> is not much of an issue if the primary TLB is >= 256 entries, but it's
    really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
    page tables advantages.

    Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by
    a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary"
    TLB (or not with explanation) ?!?

    I think the 32-entry fully associative is the primary TLB. The 1024 entry
    is a secondary TLB. Hits in the primary TLB cause no stalls at all.
    Misses in the primary probably lead to stalls (but it is possible the
    6 cycle latency can be hidden sometimes), so it costs performance.

    If the primary TLB (the one with no stalls) is only 4 entries, then
    we can easily calculate a pretty substantial stall penalty for most
    programs since it will pay the 6 cycle stall a lot (and that's for a L2
    TLB hit). So we make the L1 TLB as big as possible, but it cannot be made too large or its too slow, so it ends up between 32- and 128-entries
    (256 tops).


    Full associativity isn't worth it.

    The gains from more associativity seem to max out at around 8-way, and
    32 pages isn't nearly enough coverage (very high TLB miss rates would be unavoidable).

    Better IMO to do a larger set associative TLB, maybe 8-way if one can
    afford it.

    At 256x4 (1024 total TLBEs), miss rate is acceptably low.

    At 1024x8 (8192 total TLBEs), miss rate would essentially drop to 0 in
    my test programs. Major cost though is the cost of the SRAM/BRAM.


    For the L1 TLB's, had noted that simple direct mapping can work well.
    Except for certain adverse scenarios, direct mapping seems like a likely strategy.

    ...


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Feb 9 11:56:28 2026
    From Newsgroup: comp.arch

    The article requires signing in with an IEEE membership. I used to have
    a student membership many years ago. I let it expire as I was not using it, this was before the internet was big. Every so often I contemplate getting one, but I am trying to keep costs down.

    It used to be easy to find articles in authors' web pages, but it's
    gotten harder and harder over the years.
    You may want to try LibGen or Anna's Archive, instead, which does seem
    to have it.


    === Stefan
    --- Synchronet 3.21d-Linux NewsLink 1.2