Forum: Too Lazy BBS

Inverted Page Tables

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Jan 29 10:33:29 2026

From Newsgroup: comp.arch

Have inverted page tables fallen out of favor? And how much OS support
is there for them?

I am thinking about using an inverted page table instead of a
hierarchical page table for the current project. The advantage is that
the entire page table should be able to fit into BRAM on the FPGA. It
should then be possible to eliminate the TLB from the design at perhaps
a small performance cost. It would reduce the LUT cost of the MMU.

Translations could work directly from the page table.

The hash function would select a page table group. Averaging about 1.2
lookups per memory access. Using the ASID xorrCOd with address bits 20 to
29 as the hash. 32k entries in the table.

Using 64kB pages to keep the page table small. 1GB RAM to manage.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Jan 29 18:04:44 2026

From Newsgroup: comp.arch

On Thu, 29 Jan 2026 10:33:29 -0500
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor?

It depends on whom do you ask.
If you ask LBT then inverted page tables never had an opportunity to
fall out his favor, because he hated them since the very first
encounter.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Jan 29 13:09:37 2026

From Newsgroup: comp.arch

I am thinking about using an inverted page table instead of a hierarchical page table for the current project. The advantage is that the entire page table should be able to fit into BRAM on the FPGA. It should then be possible to eliminate the TLB from the design at perhaps a small performance cost. It would reduce the LUT cost of the MMU.

If you have the freedom (on the OS side) to use a virtually-indexed
cache, I'd recommend you try and use a system like:

PA physical_address_of_virtual_address (VA x)
{
if (x == 0)
return 0;
else {
VA pagetable_entry = (x / PAGESIZE * sizeof (VA));
return *(PA*)pagetable_entry;
}
}

This uses the CPU cache as the TLB and does pagetable walks "from the
bottom", so its efficiency is basically not affected by the number of
levels of pagetables.

=== Stefan
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 29 21:40:27 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Have inverted page tables fallen out of favor?

Not so much as: that "they were never really IN Favor".

And how much OS support
is there for them?

Probably just enough to get buy if you copy an existing IPT.

I am thinking about using an inverted page table instead of a
hierarchical page table for the current project. The advantage is that
the entire page table should be able to fit into BRAM on the FPGA. It
should then be possible to eliminate the TLB from the design at perhaps
a small performance cost. It would reduce the LUT cost of the MMU.

An alternative is for HW to generate a hash address based on faulting
VA and make one probe into a 1-16MB TLB stored across the memory hierarchy.
If the access matches VA the rest of the data is your PTE. If not, then
you trap to SW to refill the memory-TLB. Rational:: if the hashed address
is not that bad, and the hash table is large, TLB refills end up at low
cost (delay).

Translations could work directly from the page table.

The hash function would select a page table group. Averaging about 1.2 lookups per memory access. Using the ASID xorrCOd with address bits 20 to
29 as the hash. 32k entries in the table.

Using 64kB pages to keep the page table small. 1GB RAM to manage.

But realistically, a HW table-walker with a bit of internal caching supplemented with a large L2-TLB works fine in practice.
--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Jan 29 17:36:24 2026

From Newsgroup: comp.arch

On 1/29/2026 9:33 AM, Robert Finch wrote:

Have inverted page tables fallen out of favor? And how much OS support
is there for them?

I am thinking about using an inverted page table instead of a
hierarchical page table for the current project. The advantage is that
the entire page table should be able to fit into BRAM on the FPGA. It
should then be possible to eliminate the TLB from the design at perhaps
a small performance cost. It would reduce the LUT cost of the MMU.

Translations could work directly from the page table.

The hash function would select a page table group. Averaging about 1.2 lookups per memory access. Using the ASID xorrCOd with address bits 20 to
29 as the hash. 32k entries in the table.

Using 64kB pages to keep the page table small. 1GB RAM to manage.

Inverted page tables kinda have the disadvantage though of having *both*
the disadvantages of a software managed TLB along with some of the disadvantages of a HW page walker.

The main argued benefits they have (over plain SW TLB):
Typically larger (because RAM backed);
Potentially slightly faster on process switches.
Less of a "TLB miss storm".

Common drawbacks with SW TLB:
Still need to deal with TLB misses (if page isn't in IPT);
Finite associativity means can't fit whole VAS in an IPT.

Common drawbacks with HW page walking:
Still need to access RAM, potentially multiple times, to resolve a miss;
Loss of flexibility: As with HW page walking, one can't transparently
deal with multi-level address translation (without, essentially, falling
back to just treating it like a SW TLB; or having whatever code manages
the bottom-level IPT also being aware of the use of nested address translation).

Though, maybe not as bad as with HW page walking, where the page-walker
also has to be aware of and natively deal with nested translation
(potentially a very big headache for a HW page walker).

Big set-associative IPT held in BRAM (vs in Main RAM):
Basically the same as how I am already doing SW TLB.

While it could be argued that an IPT can reduce process-switch cost by
having per-process IPT, with some cleverness this can also be faked on
top of SW TLB (by modeling the TLB per-process, and then proactively
injecting TLBE's on process switch).

One other intermediate idea I had considered was a "Last Level Page
Table", where the MMU is given the address of the last-level of each page-table as its own specialized TLBE (after any requisite address translation), which can potentially have some of the advantages of both
SW TLB and HW page walking:
Retains flexibility and can deal with nested translation;
Avoids many of the harder parts of a page walker;
Only needs a single memory access to resolve a request.
Could have some of the benefits of a HW page-walker:
Most of the would-be TLB misses can be handled by hardware.

Though, admittedly, haven't implemented it as of yet, but mostly because
in my case TLB miss activity tends to be low enough that it isn't a big
enough part of the total clock cycle budget (or, IOW, 256x4W already has
a pretty good hit rate in a world of mostly smallish processes).

Would have the issue that it can no longer detect success or failure
within a fixed latency, so would need more complexity to deal with an "in-flight potential miss", either needing a mechanism to signal the
request to bounce around the ring some more, or a way to temporarily
take requests off the bus and inject them back later.

Though, the most likely option being that untranslated VAS's are not
allowed to leave the L1 ringbus (so would be deflected back into the
ring until the TLB can deal with them). Does lead to the potential issue
of bus congestion leading to a scenario where the page-access response
(from the L2 cache or RAM) can't easily enter back into the L1 ring if
most of the spots are already full (with in-flight cycling requests).

Though, then again, for latency:
L1 D$: 2c
L1 I$: 2c
TLB: 3c
BridgeVcA: 3c
This is a 64x4W write-through cache that also functions as a bridge.

So, L1A has ~ 10c ring latency.
Assuming both L1's fire off 2 requests, this is around a 40% congestion,
and theoretically should not break down until around 70% (unless the
latency values have an unlucky fraction, *).

Goes and checks:
Needed logic was already in place, though as-is it would print debug
messages whenever the request is bounced back inside the L1 ring (an untranslated L1 request escaping the L1 ring would be "extra bad", as
only physically addressed requests are allowed on the L2 ring).

*: If the L2 ring has a latency that is an integer multiple of the L1
ring latency, then potentially the same messages can collide endlessly,
and the bus would effectively be stalled. In theory, could be addressed
via a mechanism to either fine-adjust or jitter L2 ring latency if needed.

--- Synchronet 3.21b-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Jan 30 11:16:43 2026

From Newsgroup: comp.arch

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS support
is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Jan 30 16:03:39 2026

From Newsgroup: comp.arch

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS support >>>> is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for a TLB. >>
For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that
cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages can
point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.

The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
begins using quadratic open addressing.

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for instance. >>
A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Jan 30 15:44:45 2026

From Newsgroup: comp.arch

On 1/30/2026 3:03 PM, Robert Finch wrote:

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch-a <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS support >>>>> is there for them?

IVTs are a terrible idea.-a Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

-a-a-a-a- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for
a TLB.

For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that >>> cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages can point to a single physical page using a a hash table. Is it just a performance issue? ASID is part of the hash/compare.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is walked on a miss instead of walking main memory. Page faults would be handled the same way.

The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
begins using quadratic open addressing.

It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.

Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle:
while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.

Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly
worse in the average case.

Ironically, also means that while small-scale alignment is often
preferable, medium and large scale alignment is best minimized (or, IOW,
ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).

Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving
associativity relative to LUT budget), in turn still favoring the use of modulo addressing.

Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an
extreme case, using latency tricks to cause a 2-way TLB to pretend to be
4 or 8 way).

Though, with something like my ringbus design this would require needing
to get clever to allow it to work (as it could not translate requests on multiple adjacent slots). Would likely need, say, a small 1-way TLB on
the front-end, followed by 2 or 3 additional slots to resolve the lookup sequentially if the 1-way TLB misses (where multiple subsequent requests
that miss on the 1-way TLB and were shadowed by a prior request taking
up the "multi-way probe" resources, would need to cycle back around the
ring for another pass through the TLB, *2).

*1: Or, semi-global in a MMU that lacks true global pages, as the
existence of true global pages are actively detrimental in some
scenarios (and it is preferable instead to divide ASIDs into groups, and
a global page is only global within a group or within related groups;
allowing for ASIDs to exist which will not see these "global" pages).

*2: And, while the ringbus continues to seem high-latency and stupid,
still seems to be one of the better options to balance performance and cheapness (simpler bus designs being painfully slow, and more advanced FIFO/queue based buses being far more complex and expensive...).

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.

A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Jan 30 19:23:56 2026

From Newsgroup: comp.arch

On 2026-01-30 4:44 p.m., BGB wrote:

On 1/30/2026 3:03 PM, Robert Finch wrote:

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch-a <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS
support
is there for them?

IVTs are a terrible idea.-a Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

-a-a-a-a- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for
a TLB.

For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW
that
cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.

The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.

It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.

It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated
after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.

Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle:
while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.

Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly worse in the average case.

I think clustering give the table more associativity. Up to eight VAs
can all hash to the same cluster. Almost as if it were eight-way
associative.

I think clustering is key to using an inverted hash-table. I got this
from studying the PowerPC.

Ironically, also means that while small-scale alignment is often
preferable, medium and large scale alignment is best minimized (or, IOW, ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).

Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving
associativity relative to LUT budget), in turn still favoring the use of modulo addressing.

Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an extreme case, using latency tricks to cause a 2-way TLB to pretend to be
4 or 8 way).

Not too worried about latency as long as it is reasonable. Every step is another eight ways associative due to clustering.

Though, with something like my ringbus design this would require needing
to get clever to allow it to work (as it could not translate requests on multiple adjacent slots). Would likely need, say, a small 1-way TLB on
the front-end, followed by 2 or 3 additional slots to resolve the lookup sequentially if the 1-way TLB misses (where multiple subsequent requests that miss on the 1-way TLB and were shadowed by a prior request taking
up the "multi-way probe" resources, would need to cycle back around the
ring for another pass through the TLB, *2).

*1: Or, semi-global in a MMU that lacks true global pages, as the
existence of true global pages are actively detrimental in some
scenarios (and it is preferable instead to divide ASIDs into groups, and
a global page is only global within a group or within related groups; allowing for ASIDs to exist which will not see these "global" pages).

*2: And, while the ringbus continues to seem high-latency and stupid,
still seems to be one of the better options to balance performance and cheapness (simpler bus designs being painfully slow, and more advanced FIFO/queue based buses being far more complex and expensive...).

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.

A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

Using a huge page size (256kB) now so the table can be kept small. There
are three hash tables, one for instructions, two for data. Same as
triple porting the table. There are still thousands of pages available.

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jan 31 01:36:59 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 4:44 p.m., BGB wrote:

On 1/30/2026 3:03 PM, Robert Finch wrote:

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch-a <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS
support
is there for them?

IVTs are a terrible idea.-a Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

-a-a-a-a- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash >>>> table based one. I tend to think of them as the same thing. I do not >>>> think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for >>>> a TLB.

For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following >>>> the challenge to using them for shared memory? Multiple VPN-> to the >>>> same PPN are possible. Is it the freeing up of physical pages in SW >>>> that
cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.

The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.

It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.

It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.

Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle: while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.

Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly worse in the average case.

I think clustering give the table more associativity. Up to eight VAs
can all hash to the same cluster. Almost as if it were eight-way associative.

I think clustering is key to using an inverted hash-table. I got this
from studying the PowerPC.

Ironically, also means that while small-scale alignment is often preferable, medium and large scale alignment is best minimized (or, IOW, ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).

Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving associativity relative to LUT budget), in turn still favoring the use of modulo addressing.

Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an extreme case, using latency tricks to cause a 2-way TLB to pretend to be
4 or 8 way).

Not too worried about latency as long as it is reasonable. Every step is another eight ways associative due to clustering.

Be careful !!

Matrix300 (SPEC89) uses all 4 transposes of DGEMM. The 4th transpose takes
a TLB miss every 4th Load; or once every 2-cycles on the 6-wide machine.
A 32-entry fully associative TLB would miss every other cycle, while a 256-entry Direct Mapped TLB would never take a miss.

--- Synchronet 3.21b-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Jan 31 12:31:36 2026

From Newsgroup: comp.arch

Robert Finch wrote:

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS support >>>>> is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for
a TLB.

For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that >>> cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages can point to a single physical page using a a hash table. Is it just a performance issue? ASID is part of the hash/compare.

Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare,
if the PTE Global flag is clear then ASID is included in the VA compare.

If the G flag is set on OS kernel pages then the effect is to make the
kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's.

The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is walked on a miss instead of walking main memory. Page faults would be handled the same way.

Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT.
Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a large TLB.
As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.

The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.

Also you can't guarantee that your whole program will fit into an single inverted/hash table of a predetermined size because you don't know in
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.

The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
begins using quadratic open addressing.

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.

A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

What is your MMU currently doing that uses 9000 LUTs?

--- Synchronet 3.21b-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Jan 31 16:29:07 2026

From Newsgroup: comp.arch

EricP wrote:

Robert Finch wrote:

??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.

Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare,
if the PTE Global flag is clear then ASID is included in the VA compare.

If the G flag is set on OS kernel pages then the effect is to make the
kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's.

The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.

One could achieve the same effect as the CAM approach by picking off the
range of addresses assigned to the OS and assigning them to ASID = 0,
and having the process range use the curent non-zero ASID register.

If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
and we added a 12-bit ASID then that gives a 76-bit virtual space
with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
1..4095 are for processes.

The MMU has a register MMU_OS_START that hold the OS unsigned lower
start address and a comparator. If the VA is >= MMU_OS_START then
it uses an ASID = 0, otherwise it comes from the MMU_ASID register.

This 12-bit value is concatenated with the 52-bit VPN to give a
64-bit combined ASID-VPN index value to lookup in the TLB.

The OS software that assigned ASIDs also skips assigning 0 to any process.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Jan 31 17:03:54 2026

From Newsgroup: comp.arch

On 2026-01-31 12:31 p.m., EricP wrote:

Robert Finch wrote:

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch-a <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS
support
is there for them?

IVTs are a terrible idea.-a Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

-a-a-a - Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for
a TLB.

For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW
that
cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.

Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare,
if the PTE Global flag is clear then ASID is included in the VA compare.

If the G flag is set on OS kernel pages then the effect is to make the
kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's.

A global bit is not being used. The address space is divided in two by
the MSB. The kernel space has the MSB set and is unmapped. Previously I
used an ASID of zero to represent the kernel address space. The ASID is
not being used as an index.

The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.

Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT. Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a large TLB. As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.

The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.

I am not sure about getting Linux working. I may just use my own OS.
Skimming through the mmap() code has given me an idea of just how
complex the software is.

Also you can't guarantee that your whole program will fit into an single inverted/hash table of a predetermined size because you don't know in
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.

Collisions will not push other entries out, they just bounce to a
different empty entry. If one cannot be found then it is an out-of-memory.

Testing some. Testing just allocates pages without freeing them so it is
a bit unrealistic.
The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
set). Average number of bounces: 1.8 per lookup. This was with random
ASIDs and addresses.
The table can be clocked at twice the CPU rate.

I have been wondering how to collect and use stats to control things.

The entries in the IVT for ASIDs no longer in use could be cleared automatically by a background HW process.

The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.

A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

What is your MMU currently doing that uses 9000 LUTs?

I tried to over build it originally. It currently has a triple ported L2
TLB so that L2 TLB entries can be used directly to form addresses on a
miss. This trims at least one clock cycle off the miss time for L1. The
L1 TLB is fully associative and triple ported as well. L2 is also split
into two TLBs one for regular pages and a second TLB for huge pages.
There is a triple-ported miss queue. The TLB hit logic is a bit scary.
Then there is the logic to update the accessed and modified flags of the
TLB entry.

I have been working on the TLB to make L2 single ported and the miss
queue single ported as well. That should reduce both the LUT cost and
BRAM cost. It will reduce performance a little bit if there are
concurrent misses happening.

--- Synchronet 3.21b-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Sun Feb 1 15:28:47 2026

From Newsgroup: comp.arch

In article <10lji2u$2haq2$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

On 2026-01-30 4:44 p.m., BGB wrote:

On 1/30/2026 3:03 PM, Robert Finch wrote:

[ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
be searched before moving to the next entry].

It isn't obvious (at least to me) how what you are describing is much
different from a SW managed TLB.

It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated >after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.

I'm going to guess your physical address space is just 32-bits, otherwise
this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
44-bit physical address space would mean 2^28 entries, with each entry
being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
entries of 64-bits each.

A page table which doesn't scale is problematic. Especially if it needs
OS support. Who wants to invest effort in a dead-end idea?

Your hashed page table has issues which I think you should address:

1) Supporting multiple page sizes is a pain. If you have 2MB pages and
64KB pages, there are 32 64KB pages for each 2MB page. So do you
lazily load the hashtable only on misses, or load all 32 aliases
at once?

2) It's common to just do a single hash lookup, then trap to software. Doing
probing is a big pain. The problems come from removing entries.
If you page out a page, it leaves gaps in the table. When you do
a hardware lookup, and find no valid pages at the first entry, this
doesn't tell you anything! The obvious solution is ALL hardware
lookups must do the maximum searching before declaring a hash-table
miss--which you say is 30 steps, checking 240 entries. I'm not sure
there's a better solution than that, but this is a pretty big
penalty for a TLB miss.

3) It's very complex to efficiently update the hash table. On inserting
a new entry, where do you put it? On one hand, a "move-to-front"
strategy seems like a good idea--newly added pages seem like they
should get lookup priority. But how do you do this? It looks like
each software handled miss has a TON of work to do. It seems like
you need to re-hash each entry you consider moving, to make sure it
stays on its proper hash chain. And keeping the OS page structures
in sync looks like even more work.

4) The global nature of the hash table means multiple threads probably need
a lock to change it. So misses to software have to get a lock before
making any hashtable changes. This is very expensive. This is
especially bad when a process ends--all of its translations need to
be removed, and so all page table misses are blocked while that
happens. This is bad.

A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash
of ASID+VA, and does a single probe. It might be useful to do two hash
tables: one for large pages, and one for small pages (with 2 lookups).
On any miss, the hash entry to be updated is easy to find for software.
But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you
reverse bits and do other stuff that may be hard for software to calculate,
you make it easy).

This gives you the benefits of a hardware table walker with less trouble.

An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.

Kent
--- Synchronet 3.21b-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Feb 1 12:49:53 2026

From Newsgroup: comp.arch

Robert Finch wrote:

On 2026-01-31 12:31 p.m., EricP wrote:

Robert Finch wrote:

The hash table is only slightly different than a giant TLB. The TLB
is walked on a miss instead of walking main memory. Page faults would
be handled the same way.

Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT.
Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a large
TLB.
As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.

The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.

I am not sure about getting Linux working. I may just use my own OS. Skimming through the mmap() code has given me an idea of just how
complex the software is.

It was just an example of how one can use a page table to emulate
any other kind of page table.

Also you can't guarantee that your whole program will fit into an single
inverted/hash table of a predetermined size because you don't know in
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.

Collisions will not push other entries out, they just bounce to a
different empty entry. If one cannot be found then it is an out-of-memory.

Testing some. Testing just allocates pages without freeing them so it is
a bit unrealistic.
The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
set). Average number of bounces: 1.8 per lookup. This was with random
ASIDs and addresses.
The table can be clocked at twice the CPU rate.

One might make the hash function controllable so you can optimize the
bounce length for different programs. Also the max bounce limit should
be a dynamic HW register and not hard wired.

I have been wondering how to collect and use stats to control things.

The entries in the IVT for ASIDs no longer in use could be cleared automatically by a background HW process.

That implies some kind of ASID allocator to track which are in-use.
It also implies a whole IVT scan as each ASID is freed,
or maintain a pending free list and a scavenger process.

Simpler to have create process assign a new ASID by incrementing
an unsigned up counter that starts at 1. When you hit its max then reset
to 1 and clear the whole IVT for all non-zero ASID entries in one pass.

--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Feb 1 16:12:34 2026

From Newsgroup: comp.arch

On 1/30/2026 7:36 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 4:44 p.m., BGB wrote:

On 1/30/2026 3:03 PM, Robert Finch wrote:

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch-a <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS >>>>>>>> support
is there for them?

IVTs are a terrible idea.-a Consider the challenge of using them >>>>>>> to share memory between virtual address spaces in a
multiprocessing system, efficiently.

-a-a-a-a- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash >>>>>> table based one. I tend to think of them as the same thing. I do not >>>>>> think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked >>>>>> in hardware very quickly, one cycle per step as opposed to walking the >>>>>> page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for >>>>>> a TLB.

For the IVT implementation the table is twice the size needed to cover >>>>>> the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following >>>>>> the challenge to using them for shared memory? Multiple VPN-> to the >>>>>> same PPN are possible. Is it the freeing up of physical pages in SW >>>>>> that
cause an issue?

Consider mmap(file) in multiple different processes at different VAs. >>>>> So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a >>>> performance issue? ASID is part of the hash/compare.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is >>>> walked on a miss instead of walking main memory. Page faults would be
handled the same way.

The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.

It isn't obvious (at least to me) how what you are describing is much
different from a SW managed TLB.

It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated
after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.

Errm, now I am feeling uncertain how implementing this would be, viable...

Sometimes, it ends up being better to implement the minimum complexity mechanism that can work, and only implement a more complex mechanism if
there is a significant drawback or inefficiency with the simpler
mechanism that the more complex mechanism could address.

Like, I can note my use of a custom RP2 compression scheme:
It is a byte oriented LZ variant:
Typically slightly better than LZ4 for general data;
Typically slightly worse than LZ4 for program binaries.

I have one use-case where I use RP2:
For compressing voxel chunks in my BT3 3D engine.

Each chunk is 16*16*16 blocks, with one of several in-RAM storage schemes:
Single block type (no index);
2-16 block types, 4-bit index, 2K/chunk
17-256 block types, 8 bit index, 4K/chunk
257-4096: no index, 32K / chunk.

So, for storage, it flattens out the chunk:
Simple header;
List of block types;
Index data.

Then RP2 compresses the chunk.

The previous engine (BT2) had used a LZ+STF+AdRice scheme here, but this
is slower than a byte-oriented scheme (like RP2 or LZ4), and the
compression delta is small. So, BT3 originally went with RP2 mostly as
it was faster.

Technically, either LZ4 or RP2 could be used, but RP2 notably beats LZ4
here, as the smallest run encoding ends up used heavily:
dddddddd-dlllrrr0:
0-7 literal bytes, 3-10 matched bytes, distance=0..511
A distance of 0 is invalid, can be used as an escape code.

The next match size up is 3 bytes: 4-67 match bytes, vs an 8K distance.
This seems to be the main difference with program binaries, which have
more of a skew towards short long distance matches, which favor LZ4.

RP2 though has a stronger advantage when the input data is under 8K.
Though, it is possible that a 35 byte match over 16K might have been
better on average (possibly with another match type for "long match at
short distance", or RLE like patterns).

I recently evaluated a possible trick of bolting on STF+AdRice after the
fact (on top of RP2) as a sort of post-compressor entropy stage (jank,
but limits computational complexity on the encoder side).

Had then noted a pattern (based on the RP2-compressed size):
Under 512 bytes: Entropy stage typically makes it bigger.
This was the dominant size class here.
1-2K: Entropy helps slightly, but fails to cross a "useful threshold".
4K+: Mostly unseen.

Then went and grabbed a bitwise range coder, and tried using this as a
test, along with something resembling a "mini-LZMA" (LZ + bitwise range
coder, along similar lines to LZMA).

Result, none crossed the threshold...
The threshold: Give an additional 20% size reduction over the use of RP2.

Might at least get used sometimes if the threshold were 5% or 10%, but
don't really want to spend the comparably steep cost of an entropy coder
for a merely 5% or 10% gain.

Why not Huffman (or Deflate?):
Because, ironically, for small highly compressible data, Deflate
actually turns up being worse off here than RP2.

Here, one wants the entropy-coded compressors to be more, "break glass
in case of compression emergency".

Though, that said, bolting an LZMA style range coder onto RP2 would
still appear to be a moderately strong possibility (in terms of
compression), even if bitwise range coding is almost impractically slow
(on a desktop PC, one is hard pressed to push more than around 50 MB/sec through a range coder).

Though, that said, there is sometimes something in my mind that seems to
be trying to yell out that the LZMA range coder design holds a clue for greatly increasing the efficiency and learning rate of neural nets (but
the rest of my mind can't figure out how to combine them; and both
generic algorithms and back-propagation leave something to be desired...).

If anything, one could try to go the other way and use a neural-net as a predictive model for a range-coder, but then one would merely be burning
CPU time.

Though, did note a phenomenon with backprop neural nets where they were
using weakly-biased neurons as a sort of memory between training steps
(in the case of repeating test cases, it would "memorize" the answer and
then use it to cheat on the following run when given the same inputs).
Which is in some sense analogous to the normal behavioral pattern in a range-coder.

Or, basically, the training process was working like "flash cards" and
when it would get an answer wrong, the training algo would note it and
then present the same inputs again a few passes later. The NN was then effectively remember inputs and the output adjustment and then repeating
them again (without actually learning anything), and the net seemingly
got pretty good at this (not learning the actual task, just sort of
optimizing itself to be able to cheat at the NN training algorithm's use
of repetition on wrong answers).

Where, in my NN experiments, it would appear that near-0 behavior for
weights may be more relevant than previously expected (and handling them
in a similar way to DEC/PDP style floating-point may not be ideal). Even
if it mostly reflects a use-case that is an adverse edge case in NN
training (but, in a dynamically adapted net, could potentially be a useful/desirable property; working as a sort of working memory; well,
along with feeding part of the activation outputs back into the inputs,
etc).

Well, also in range-coder predictors, the relative adjustment strength
depends on which direction one is moving and where one is in the space.
If near 0, then movements away from 0 are rapid, but as one gets
further away, it gets harder to form an increment (at the upper
extremes, it becoming impossible to move further). The closer one is
towards 0, the harder it is to approach 0. Though this is likely an
artifact of the distortion of what linear adjustments look like within a non-linear space.

Comparably, the curve of FP values is steeper than that of RC predictors (exponential rather than square).

Well, some of this could imply using some lookup-tables for weight
adjustment in the backprop stage (similar to those used for an
range-coder). Might also need to consider using IEEE style denormals for
FP8.

Still feels as if I am missing something.

Well, there is still the issue that I needed to run the NN training
using FP16 weights internally, as directly training the FP8 weights
wasn't stable enough. So, part of the issue is figuring out stable live adaptation on FP8 weights (which does kinda resemble the range-coder scenario). Well, and that (post backprop) the weight-update in this case
is mostly XOR'ing the sign-bits to figure out which direction to adjust
the weight (likewise, also resembling the RC update scenario), though
skipping weight adjustment when the error is 0 (and skipping downward adjustment when the weight is already 0, etc).

Issue though seems to be that while FP8 works well enough for forward inference, the direct jumps between FP8 values are too large for
backprop to work with effectively, and it essentially ends up needing
some sub-ULP bits for finer adjustment (leading to the issue of using
FP16 for the backprop stage; which would limit practicality for a live adaptation net).

Also, would need to figure out a good way to replace "correct/incorrect" training with something more like "pain/no-pain" adaptation, but with
"pain driven adaptation" one is left merely to do backprop using the
inverse of the previous outputs as the error vector, which is far less effective (and not clearly much better than just using genetic
algorithms as the training process; except that GAs limit scenarios a
lot more than the use of positive and negative reinforcement as a
training algorithm).

But, the downside of positive/negative reinforcement is getting
something much beyond simplistic insect-like behaviors.

...

Main difference I think was that I ended up mostly settling on modulo
addressing rather than hashing, but the reasoning here is more subtle:
while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.

Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly
worse in the average case.

I think clustering give the table more associativity. Up to eight VAs
can all hash to the same cluster. Almost as if it were eight-way
associative.

I think clustering is key to using an inverted hash-table. I got this
from studying the PowerPC.

Ironically, also means that while small-scale alignment is often
preferable, medium and large scale alignment is best minimized (or, IOW, >>> ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).

Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving
associativity relative to LUT budget), in turn still favoring the use of >>> modulo addressing.

Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an
extreme case, using latency tricks to cause a 2-way TLB to pretend to be >>> 4 or 8 way).

Not too worried about latency as long as it is reasonable. Every step is
another eight ways associative due to clustering.

Be careful !!

Matrix300 (SPEC89) uses all 4 transposes of DGEMM. The 4th transpose takes
a TLB miss every 4th Load; or once every 2-cycles on the 6-wide machine.
A 32-entry fully associative TLB would miss every other cycle, while a 256-entry Direct Mapped TLB would never take a miss.

In my own testing, it seemed for the main (L2) TLB, assuming modulo
indexing and SW refill:
1-way: doesn't work for L2 TLB (works well for an L1 TLB though);
2-way: bare minimum, but can fail in certain edge cases (*1);
4-way: Works pretty well;
8-way: "near perfect" on the associativity front.

In many contexts, there doesn't seem to be much gain in going over 8-way associativity, and only a modest gain vs 4-way.

*1: There is a risk for cases where both D$ and I$ miss and both cross
page boundaries in the access. While this is less of an issue with
modulo addressing than with hashing (if this happens with hashed
indexing and 2-way associativity, one is SOL). It seems to squeeze
through, if barely, with modulo indexing (because no two adjacent pages
can stomp the same index).

As noted, hashing is a risk here:
Makes average better at expense of worst case;
Effectively makes 4-way the bare minimum;
Paradoxically, while making the average case better, it increases the associativity required to avoid performance loss in adverse cases (hash collisions).

Some of the same benefits of hashing can also be realized in the OS via strategic use of ASLR (where ASLR tends to benefit TLB performance in
this case).

Can note that 8-way is an improvement over 4-way, but the gains are
comparably smaller. Likewise, 16-way seems to gain very little over 8-way.

Once the required associativity is present, more benefit is seen mostly
by making the TLB bigger (if possible, you really want the working set
to be able to fit into the TLB).

But, as noted, the "Last Level Page-Table TLB" scheme I had considered previously could greatly expand the working set of the MMU without the
cost of a full page walker.

But, more a case of investing resources into fixing an issue that is not
(yet) a significant performance issue.

Granted, can note that both my BT3 engine and Quake 3 Arena tend to
exceed the working-set size of the MMU, so are prone to a higher
frequency of TLB miss interrupts. But, fixing this still wonk make
Quake3 all that playable on a 50MHz CPU.

...

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Feb 3 01:03:36 2026

From Newsgroup: comp.arch

After getting bogged down testing the hash table, I decided to shelf it
for now. And...

I re-wrote Qupls4 TLB for hierarchical page tables. It is much cleaner
code now. The TLB is taking up only about 600 LUTs and 8 BRAMs now. It
is also simpler being only a single level TLB. Three copies of the TLB
would be needed.

Thinking about adding caching of translation steps to avoid main memory access.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Feb 3 01:23:05 2026

From Newsgroup: comp.arch

On 2026-02-01 12:49 p.m., EricP wrote:

Robert Finch wrote:

On 2026-01-31 12:31 p.m., EricP wrote:

Robert Finch wrote:

The hash table is only slightly different than a giant TLB. The TLB
is walked on a miss instead of walking main memory. Page faults
would be handled the same way.

Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT. >>> Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a
large TLB.
As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.

The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.

I am not sure about getting Linux working. I may just use my own OS.
Skimming through the mmap() code has given me an idea of just how
complex the software is.

It was just an example of how one can use a page table to emulate
any other kind of page table.

Also you can't guarantee that your whole program will fit into an single >>> inverted/hash table of a predetermined size because you don't know in
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.

Collisions will not push other entries out, they just bounce to a
different empty entry. If one cannot be found then it is an out-of-
memory.

Testing some. Testing just allocates pages without freeing them so it
is a bit unrealistic.
The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
set). Average number of bounces: 1.8 per lookup. This was with random
ASIDs and addresses.
The table can be clocked at twice the CPU rate.

One might make the hash function controllable so you can optimize the
bounce length for different programs. Also the max bounce limit should
be a dynamic HW register and not hard wired.

The max bounce limit is a HW register. Making the hash function
controllable is bit tricky as it is used by HW. It might be easier to
provide a selection of a few different hashes. Hash function reminds me
of crypto rotates and combinations.

I have been wondering how to collect and use stats to control things.

The entries in the IVT for ASIDs no longer in use could be cleared
automatically by a background HW process.

That implies some kind of ASID allocator to track which are in-use.
It also implies a whole IVT scan as each ASID is freed,
or maintain a pending free list and a scavenger process.

For the hash table a small ASID (eg 8 bits) is used. It could be tracked
with a bitmnap.

If the ASIDs are not reused too quickly the scavenger may not need to be
fast. A ASID FIFO could be used to allocate / free ASIDs.

Simpler to have create process assign a new ASID by incrementing
an unsigned up counter that starts at 1. When you hit its max then reset
to 1 and clear the whole IVT for all non-zero ASID entries in one pass.

A garbage collection approach may work for removed entries. It might be
a complicated state machine though to do in HW. It could look for
invalid or entries with defunct ASIDs.

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 4 20:42:25 2026

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

EricP wrote:

Robert Finch wrote:

??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.

Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare, if the PTE Global flag is clear then ASID is included in the VA compare.

If the G flag is set on OS kernel pages then the effect is to make the kernel virtual pages one single address space on all cores with no associated ASID, while process virtual spaces reside in their own ASID's.

The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.

One could achieve the same effect as the CAM approach by picking off the range of addresses assigned to the OS and assigning them to ASID = 0,
and having the process range use the curent non-zero ASID register.

If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
and we added a 12-bit ASID then that gives a 76-bit virtual space
with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
1..4095 are for processes.

SPARC V8 had 13-bit ASIDs and found that this was way to short for
the market SUN was targeting. SPARC V9 went with 16-bit ASIDs and
I think this is as SHORT as an ASID can be for "reasonable" future
proofing.

The MMU has a register MMU_OS_START that hold the OS unsigned lower
start address and a comparator. If the VA is >= MMU_OS_START then
it uses an ASID = 0, otherwise it comes from the MMU_ASID register.

This 12-bit value is concatenated with the 52-bit VPN to give a
64-bit combined ASID-VPN index value to lookup in the TLB.

The OS software that assigned ASIDs also skips assigning 0 to any process.

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 4 20:58:27 2026

From Newsgroup: comp.arch

kegs@provalid.com (Kent Dickey) posted:

In article <10lji2u$2haq2$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

On 2026-01-30 4:44 p.m., BGB wrote:

On 1/30/2026 3:03 PM, Robert Finch wrote:

[ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
be searched before moving to the next entry].

It isn't obvious (at least to me) how what you are describing is much
different from a SW managed TLB.

It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated >after 30 steps in the walk though. That is a miss on 240 possible >locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.

I'm going to guess your physical address space is just 32-bits, otherwise this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
44-bit physical address space would mean 2^28 entries, with each entry
being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
entries of 64-bits each.

A page table which doesn't scale is problematic. Especially if it needs
OS support. Who wants to invest effort in a dead-end idea?

Your hashed page table has issues which I think you should address:

1) Supporting multiple page sizes is a pain. If you have 2MB pages and
64KB pages, there are 32 64KB pages for each 2MB page. So do you
lazily load the hashtable only on misses, or load all 32 aliases
at once?

I, for one, want to hear more about how he integrates multiple page
sizes into an IPT or the associated TLB.

2) It's common to just do a single hash lookup, then trap to software. Doing
probing is a big pain. The problems come from removing entries.
If you page out a page, it leaves gaps in the table. When you do
a hardware lookup, and find no valid pages at the first entry, this
doesn't tell you anything! The obvious solution is ALL hardware
lookups must do the maximum searching before declaring a hash-table
miss--which you say is 30 steps, checking 240 entries. I'm not sure
there's a better solution than that, but this is a pretty big
penalty for a TLB miss.

3) It's very complex to efficiently update the hash table. On inserting
a new entry, where do you put it? On one hand, a "move-to-front"
strategy seems like a good idea--newly added pages seem like they
should get lookup priority. But how do you do this? It looks like
each software handled miss has a TON of work to do. It seems like
you need to re-hash each entry you consider moving, to make sure it
stays on its proper hash chain. And keeping the OS page structures
in sync looks like even more work.

4) The global nature of the hash table means multiple threads probably need
a lock to change it. So misses to software have to get a lock before
making any hashtable changes. This is very expensive. This is
especially bad when a process ends--all of its translations need to
be removed, and so all page table misses are blocked while that
happens. This is bad.

On the other hand, when an OS is provided with means to shrink the hier- archial page tables, it is rarely used.

For example, cat can be compiled and linked on My 66000 into an object
module of 10-12 pages--including the paging tables themselves (1 page).

A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used.

The other thing I want to hear is how HyperVisor or Host OS uses its IVPT
to remap Guest OS IVP into the single physical address space. That is how
are IVPTs virtualized ??

It's indexed by a hash of ASID+VA, and does a single probe. It might be useful to do two hash tables: one for large pages, and one for small pages (with 2 lookups).
On any miss, the hash entry to be updated is easy to find for software.
But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you reverse bits and do other stuff that may be hard for software to calculate, you make it easy).

This gives you the benefits of a hardware table walker with less trouble.

An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.

Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by
a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary"
TLB (or not with explanation) ?!?

Kent

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 4 21:08:28 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-01 10:28 a.m., Kent Dickey wrote:

In article <10lji2u$2haq2$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

On 2026-01-30 4:44 p.m., BGB wrote:

On 1/30/2026 3:03 PM, Robert Finch wrote:

[ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to be searched before moving to the next entry].

It isn't obvious (at least to me) how what you are describing is much
different from a SW managed TLB.

It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated >> after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.

I'm going to guess your physical address space is just 32-bits, otherwise

Yes, this is for a 29-bit DRAM address space. I was going to use 2^13 entries with 256kB (or possibly smaller) pages. It is only usable with a smaller address space that can fit into BRAM.

this table is huge. I think you said you have 64KB pages (but then you said
256KB pages--which is just too big, so I'll ignore that), which for a 44-bit physical address space would mean 2^28 entries, with each entry being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16 entries of 64-bits each.

A page table which doesn't scale is problematic. Especially if it needs
OS support. Who wants to invest effort in a dead-end idea?

Your hashed page table has issues which I think you should address:

1) Supporting multiple page sizes is a pain. If you have 2MB pages and
64KB pages, there are 32 64KB pages for each 2MB page. So do you
lazily load the hashtable only on misses, or load all 32 aliases
at once?

Supporting multiple page sized is a pain even for a hierarchical table. Multiple TLBs are used, so maybe multiple hash tables could be used. But
I was not planning on supporting more than one page size.

I support multiple page sizes, and at both the architectural level and
in the TLB circuits I did not find it all that difficult. {Hint: I built
one of the TLB for Ross HyperSparc chips.} There are several ways to do
it in logic, none of it "hard". Most of it "a bit different".

Defining Super pages is "just stopping" early in the tablewalk.
Skipping intermediate levels is just like "stopping early" except
for the stopping part.
Skipping top layers is "child's play" at that point {Smaller VAS}.

My 66000 has all of that; but in addition:
My 66000 SuperPages use the unused PA bits as a limit function so a
SuperPage of 1024 pages can be used to map 1-1024 actual page(s).

2) It's common to just do a single hash lookup, then trap to software. Doing
probing is a big pain. The problems come from removing entries.
If you page out a page, it leaves gaps in the table. When you do
a hardware lookup, and find no valid pages at the first entry, this
doesn't tell you anything! The obvious solution is ALL hardware

Thanks, I had forgotten the issue with removes. Inserts are easy without removes. I wonder if the removes could be done in bulk similar to a
garbage collection. Just mark the entry as deleted rather than leaving a gap, then have a process that periodically rebuilds the table. For the
table size I am planning on using garbage collecting it may not be too bad.

The table probes until it finds the entry, or an empty entry in a group where the lookup should be. It stops after a certain number of probes if there are no empty entries in the group.

The lookup returns the group and the group empty status at which there
is an empty entry if the lookup is not found. So, adding a new entry
should be straightforward.

lookups must do the maximum searching before declaring a hash-table
miss--which you say is 30 steps, checking 240 entries. I'm not sure
there's a better solution than that, but this is a pretty big
penalty for a TLB miss.

30 steps (in HW) is about as fast as a divide operation. It only works
for BRAM, it was DRAM it would be many clocks per lookup.

Also: a far cry from a 6-cycle L2 TLB with 1024-E.

3) It's very complex to efficiently update the hash table. On inserting
a new entry, where do you put it? On one hand, a "move-to-front"
strategy seems like a good idea--newly added pages seem like they
should get lookup priority. But how do you do this? It looks like
each software handled miss has a TON of work to do. It seems like

I was not planning on anything that complex. The miss returns the group
at which the entry should be located, so SW should know where to insert.
For testing, the test bench fakes inserting pages. The issues rise when
the table gets full.

you need to re-hash each entry you consider moving, to make sure it
stays on its proper hash chain. And keeping the OS page structures
in sync looks like even more work.

4) The global nature of the hash table means multiple threads probably need
a lock to change it. So misses to software have to get a lock before
making any hashtable changes. This is very expensive. This is
especially bad when a process ends--all of its translations need to
be removed, and so all page table misses are blocked while that
happens. This is bad.

Definitely an issue. May have only the system (one) thread able to
update it.

A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash
of ASID+VA, and does a single probe. It might be useful to do two hash tables: one for large pages, and one for small pages (with 2 lookups).
On any miss, the hash entry to be updated is easy to find for software. But: there's effectively no associativity--but just increase the hash table's
size if software detects that you are getting conflicts. Pick a good
hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you reverse bits and do other stuff that may be hard for software to calculate, you make it easy).

This gives you the benefits of a hardware table walker with less trouble.

An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
is not much of an issue if the primary TLB is >= 256 entries, but it's really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
page tables advantages.

Kent

--- Synchronet 3.21b-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Feb 5 15:07:55 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> writes:

On 2026-02-01 12:49 p.m., EricP wrote:

Simpler to have create process assign a new ASID by incrementing
an unsigned up counter that starts at 1. When you hit its max then reset
to 1 and clear the whole IVT for all non-zero ASID entries in one pass.

A garbage collection approach may work for removed entries. It might be
a complicated state machine though to do in HW. It could look for
invalid or entries with defunct ASIDs.

AFAIK the approach outlined by EricP has been used in commercial
hardware. It's simple, and if there are enough ASIDs, cheap (because
resets are rare). So better add more bits to the ASIDs than introduce complicated machinery for reclaiming ASIDs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21b-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Fri Feb 6 05:23:38 2026

From Newsgroup: comp.arch

In article <1770238707-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

kegs@provalid.com (Kent Dickey) posted:

An advantage of hashed page tables is the TLB itself has no "intermediate" >> entries. This lets the primary TLB hold more useful translations. This
is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
page tables advantages.

Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by
a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary"
TLB (or not with explanation) ?!?

I think the 32-entry fully associative is the primary TLB. The 1024 entry
is a secondary TLB. Hits in the primary TLB cause no stalls at all.
Misses in the primary probably lead to stalls (but it is possible the
6 cycle latency can be hidden sometimes), so it costs performance.

If the primary TLB (the one with no stalls) is only 4 entries, then
we can easily calculate a pretty substantial stall penalty for most
programs since it will pay the 6 cycle stall a lot (and that's for a L2
TLB hit). So we make the L1 TLB as big as possible, but it cannot be made
too large or its too slow, so it ends up between 32- and 128-entries
(256 tops).

I'm just pointing out the intermediate hierarchical entries (the page
table entries which are just page table pointers) need to cached in
the TLB to make future lookups faster. And putting them there reduces
the size of the TLB for "useful" entries. You can try to avoid this
penalty by only putting page table pointer entries in the L2 TLB, but then you're just making primary TLB misses slower, especially since the L2 TLB lookup cannot scan all entries like the primary can--you have to probe
each one separately.

If you want to translate the address 0x1234_56789000 using 4KB pages, you
can do a SINGLE primary associative TLB lookup, and look for:

1) Page entry for 0x1234_56789000
2) Page-table pointer for 0x1234_566xxxxx
3) Page-table pointer for 0x1234_4xxxxxxx
4) Page-table pointer for 0x120x_xxxxxxxx

(you may need do a second lookup to get page-table entries, since it's expensive to deal with multiple hits in one clock).

With the page-table-pointer for 0x1234_566xxxxx, you then do a single
data cache lookup to get the page entry (say 3 more clocks). So putting
these page-table pointers in the L1 TLB has a big benefit to the page
walk times.

But note that for the L2 TLB, looking each of those up may be a separate lookup. And you'd want to do a lookup in the reverse order from the hierarchical table walk--you want the page if it exists, then the closest page-table, etc. At 6-cycles latency, it seems like you would do the
lookups back-to-back and not wait for the results, so you would look up
all 4 in the L2 TLB in 9 clocks. And putting page-table-pointers in the
L2 TLB is tricky, the indexing function needs thought (it's similar to the problem for handling variable page sizes with a hash table, but we KNOW
which size we're looking for at each step at least).

And I'm just pointing out putting these page table pointer entries in
the L1 TLB makes the L1 TLB feel "smaller". For even the smallest
process has three separate translation areas: code, data, and stack, and
these are all on different pages, and they will all be touched
regularly. So there's intermediate page table pointers crowding other
useful translations out of the L1 TLB--this tiny process of 3 pages
total could use 4+2+2 L1 TLB entries (I'm assuming the regions are at
least 1GB apart, and using 4KB pages, which are reasonable assumptions).

Accounting for the overhead, it seems like almost any process will waste
at least 7 L1 TLB entries to these intermediate page table pointer
entries, so your 32-entry L1 TLB will generally hold no more than 25
useful pages.

Kent
--- Synchronet 3.21b-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Feb 6 09:29:08 2026

From Newsgroup: comp.arch

Anton Ertl wrote:

Robert Finch <robfi680@gmail.com> writes:

On 2026-02-01 12:49 p.m., EricP wrote:

Simpler to have create process assign a new ASID by incrementing
an unsigned up counter that starts at 1. When you hit its max then reset >>> to 1 and clear the whole IVT for all non-zero ASID entries in one pass.

A garbage collection approach may work for removed entries. It might be
a complicated state machine though to do in HW. It could look for
invalid or entries with defunct ASIDs.

AFAIK the approach outlined by EricP has been used in commercial
hardware. It's simple, and if there are enough ASIDs, cheap (because
resets are rare). So better add more bits to the ASIDs than introduce complicated machinery for reclaiming ASIDs.

- anton

The method I described before was a bit brute force, as it looks like
it has to stall the whole system while it erases the old ASIDs.
It is inefficient as it erases all ASIDs including ones recently assigned.

One issue with invalidating ASIDs is that the L1-TLB may be fully
associative, the L2-TLB often are set associative.
While a a full assoc TLB can invalidate multiple matching entries in
1 clock, the set assoc TLB requires a scan doing a read-compare-write
operation on all the ways in each set, taking 2 or more clocks per set.
So an L2-TLB with 512 sets could take 1024 clocks or more to complete
an ASID shoot down.

This would also cause all running processes to have a flurry of TLB misses
as they reload the entries just purged.

I would want something soother, continuous, that usually didn't need to
stall the whole system, and only recycles the oldest ASIDs.

Something like, split the ASIDs into 4 groups using the high 2 msb.
As the ASID counter increments from group N to group N+1, it initiates an invalidate of group N+2, such that it retains the recent ASIDs in group N
and N-1. And this is circular so when it starts assigning in G3 it
initiates the invalidate of G0. The invalidate is also asynchronous,
so the system can continue to run and assign ASIDs while it invalidates
the next group ahead (there are some interlocks waiting for ACK's).

e.g. Say the ASID counter is 7 bits so 4 groups of 32 IDs, called G0..G3.
The current assignment ASID counter = 31 in G1.
When its increments to 32 in G2, it begins invalidating all group G3 ids
in the range 96..128. Any processes currently using ASIDs in the G3 range
have their ASID invalidated, and when they next run will assign a new id
in the current assignment group. All processes in groups G0 and G1 are unaffected by this ASID recycling. Core broadcasts an IPI Interprocessor Interrupt to all other cores to shoot down all group G3 ASIDs.

--- Synchronet 3.21b-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Feb 6 11:59:27 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

EricP <ThatWouldBeTelling@thevillage.com> posted:

EricP wrote:

Robert Finch wrote:

??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a >>>> performance issue? ASID is part of the hash/compare.

Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare, >>> if the PTE Global flag is clear then ASID is included in the VA compare. >>>
If the G flag is set on OS kernel pages then the effect is to make the
kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's. >>>
The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, >>> otherwise check if ASID is matched.

But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.

One could achieve the same effect as the CAM approach by picking off the
range of addresses assigned to the OS and assigning them to ASID = 0,
and having the process range use the curent non-zero ASID register.

If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
and we added a 12-bit ASID then that gives a 76-bit virtual space
with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
1..4095 are for processes.

SPARC V8 had 13-bit ASIDs and found that this was way to short for
the market SUN was targeting. SPARC V9 went with 16-bit ASIDs and
I think this is as SHORT as an ASID can be for "reasonable" future
proofing.

It also depends on whether ASIDs are assigned globally for a whole system
or locally for each core. Local assignment for each core needs less bits.
For globally assigned ASID 16 bits sounds too small for today -
they recycle too often.

When ASIDs were first introduced on x86 I remember a certain amount
of concern for the effects they would have on the fully assoc TLB.
The 32-bit x86 has a 20-bit Virtual Page Number so adding a 12-bit ASID
would increase the TLB size and power usage by 60%.

Back then I spent some time looking at how to do local 4-bit ASIDs per-core. That is, each core has its own 4-bit ASID counter and each process
remembers which ASID its used last on each core.
Each core's TLB retains entries for up to the last 16 processes to run
threads on that core, and each core recycles its ASID independently.
The 4-bit size was chosen based on the assumption that the odds are small
in a TLB containing useful entries beyond the last 16 processes to run.

This eliminates many the global ASID shoot downs but 4 bits is too small as
it would recycle too quickly causing too frequent local ASID invalidates.

Combining all the ideas, 16+ bit ASID, locally assigned per core,
with a batch TLB ASID invalidate mechanism using a bit mask for matching
ASIDs so I can invalidate all ids in just one of the groups I mentioned elsewhere; GGxx_xxxx_xxxx_xxxx where GG is the 2-bit group number.

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Feb 6 18:01:15 2026

From Newsgroup: comp.arch

kegs@provalid.com (Kent Dickey) posted:

In article <1770238707-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

kegs@provalid.com (Kent Dickey) posted:

An advantage of hashed page tables is the TLB itself has no "intermediate" >> entries. This lets the primary TLB hold more useful translations. This >> is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
page tables advantages.

Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by
a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary"
TLB (or not with explanation) ?!?

I think the 32-entry fully associative is the primary TLB. The 1024 entry
is a secondary TLB. Hits in the primary TLB cause no stalls at all.
Misses in the primary probably lead to stalls (but it is possible the
6 cycle latency can be hidden sometimes), so it costs performance.

If the primary TLB (the one with no stalls) is only 4 entries, then
we can easily calculate a pretty substantial stall penalty for most
programs since it will pay the 6 cycle stall a lot (and that's for a L2
TLB hit). So we make the L1 TLB as big as possible, but it cannot be made too large or its too slow, so it ends up between 32- and 128-entries
(256 tops).

I'm just pointing out the intermediate hierarchical entries (the page
table entries which are just page table pointers) need to cached in
the TLB to make future lookups faster.

Cached yes, in TLB no--generally cached in TableWalk Accelerator store.

And putting them there reduces
the size of the TLB for "useful" entries.

Which is why you don't put 'em there.

You can try to avoid this
penalty by only putting page table pointer entries in the L2 TLB, but then you're just making primary TLB misses slower, especially since the L2 TLB lookup cannot scan all entries like the primary can--you have to probe
each one separately.

If you want to translate the address 0x1234_56789000 using 4KB pages, you
can do a SINGLE primary associative TLB lookup, and look for:

1) Page entry for 0x1234_56789000
2) Page-table pointer for 0x1234_566xxxxx
3) Page-table pointer for 0x1234_4xxxxxxx
4) Page-table pointer for 0x120x_xxxxxxxx

(you may need do a second lookup to get page-table entries, since it's expensive to deal with multiple hits in one clock).

Another reason to put them other than in TLB. One can make all 4 accesses simultaneously, and start the tablewalk as far down the tree as possible.

With the page-table-pointer for 0x1234_566xxxxx, you then do a single
data cache lookup to get the page entry (say 3 more clocks). So putting these page-table pointers in the L1 TLB has a big benefit to the page
walk times.

A delicate balance, especially if L2 is less than 10 cycles. ------------------------

Accounting for the overhead, it seems like almost any process will waste
at least 7 L1 TLB entries to these intermediate page table pointer
entries, so your 32-entry L1 TLB will generally hold no more than 25
useful pages.

So, you don't put 'em in TLB you put 'em in TW-Accelerator Store.

Kent

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Jan 30 11:51:15 2026

From Newsgroup: comp.arch

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS support
is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for a TLB.

For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that
cause an issue?

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for instance.

A hierarchical page table is a better solution, but I was looking for something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 30 20:35:14 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS support
is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for a TLB.

For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for instance.

A hierarchical page table is a better solution, but I was looking for something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Feb 1 16:41:34 2026

From Newsgroup: comp.arch

On 2026-02-01 10:28 a.m., Kent Dickey wrote:

In article <10lji2u$2haq2$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

On 2026-01-30 4:44 p.m., BGB wrote:

On 1/30/2026 3:03 PM, Robert Finch wrote:

[ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
be searched before moving to the next entry].

It isn't obvious (at least to me) how what you are describing is much
different from a SW managed TLB.

It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated
after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.

I'm going to guess your physical address space is just 32-bits, otherwise

Yes, this is for a 29-bit DRAM address space. I was going to use 2^13
entries with 256kB (or possibly smaller) pages. It is only usable with a smaller address space that can fit into BRAM.

this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
44-bit physical address space would mean 2^28 entries, with each entry
being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
entries of 64-bits each.

A page table which doesn't scale is problematic. Especially if it needs
OS support. Who wants to invest effort in a dead-end idea?

Your hashed page table has issues which I think you should address:

1) Supporting multiple page sizes is a pain. If you have 2MB pages and
64KB pages, there are 32 64KB pages for each 2MB page. So do you
lazily load the hashtable only on misses, or load all 32 aliases
at once?

Supporting multiple page sized is a pain even for a hierarchical table. Multiple TLBs are used, so maybe multiple hash tables could be used. But
I was not planning on supporting more than one page size.

2) It's common to just do a single hash lookup, then trap to software. Doing
probing is a big pain. The problems come from removing entries.
If you page out a page, it leaves gaps in the table. When you do
a hardware lookup, and find no valid pages at the first entry, this
doesn't tell you anything! The obvious solution is ALL hardware

Thanks, I had forgotten the issue with removes. Inserts are easy without removes. I wonder if the removes could be done in bulk similar to a
garbage collection. Just mark the entry as deleted rather than leaving a
gap, then have a process that periodically rebuilds the table. For the
table size I am planning on using garbage collecting it may not be too bad.

The table probes until it finds the entry, or an empty entry in a group
where the lookup should be. It stops after a certain number of probes if
there are no empty entries in the group.

The lookup returns the group and the group empty status at which there
is an empty entry if the lookup is not found. So, adding a new entry
should be straightforward.

lookups must do the maximum searching before declaring a hash-table
miss--which you say is 30 steps, checking 240 entries. I'm not sure
there's a better solution than that, but this is a pretty big
penalty for a TLB miss.

30 steps (in HW) is about as fast as a divide operation. It only works
for BRAM, it was DRAM it would be many clocks per lookup.

3) It's very complex to efficiently update the hash table. On inserting
a new entry, where do you put it? On one hand, a "move-to-front"
strategy seems like a good idea--newly added pages seem like they
should get lookup priority. But how do you do this? It looks like
each software handled miss has a TON of work to do. It seems like

I was not planning on anything that complex. The miss returns the group
at which the entry should be located, so SW should know where to insert.
For testing, the test bench fakes inserting pages. The issues rise when
the table gets full.

you need to re-hash each entry you consider moving, to make sure it
stays on its proper hash chain. And keeping the OS page structures
in sync looks like even more work.

4) The global nature of the hash table means multiple threads probably need
a lock to change it. So misses to software have to get a lock before
making any hashtable changes. This is very expensive. This is
especially bad when a process ends--all of its translations need to
be removed, and so all page table misses are blocked while that
happens. This is bad.

Definitely an issue. May have only the system (one) thread able to
update it.

A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash of ASID+VA, and does a single probe. It might be useful to do two hash tables: one for large pages, and one for small pages (with 2 lookups).
On any miss, the hash entry to be updated is easy to find for software.
But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you reverse bits and do other stuff that may be hard for software to calculate, you make it easy).

This gives you the benefits of a hardware table walker with less trouble.

An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.

Kent

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Feb 8 19:28:51 2026

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

MitchAlsup wrote:

EricP <ThatWouldBeTelling@thevillage.com> posted:

EricP wrote:

Robert Finch wrote:

??? I think a hash table has this characteristic. Multiple VA pages >>>> can point to a single physical page using a a hash table. Is it just a >>>> performance issue? ASID is part of the hash/compare.

Be careful about ASID's - they aren't just part of the virtual address >>> in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare,
if the PTE Global flag is clear then ASID is included in the VA compare. >>>
If the G flag is set on OS kernel pages then the effect is to make the >>> kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's. >>>
The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, >>> otherwise check if ASID is matched.

But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.

One could achieve the same effect as the CAM approach by picking off the >> range of addresses assigned to the OS and assigning them to ASID = 0,
and having the process range use the curent non-zero ASID register.

If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
and we added a 12-bit ASID then that gives a 76-bit virtual space
with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
1..4095 are for processes.

SPARC V8 had 13-bit ASIDs and found that this was way to short for
the market SUN was targeting. SPARC V9 went with 16-bit ASIDs and
I think this is as SHORT as an ASID can be for "reasonable" future proofing.

It also depends on whether ASIDs are assigned globally for a whole system
or locally for each core. Local assignment for each core needs less bits.
For globally assigned ASID 16 bits sounds too small for today -
they recycle too often.

When ASIDs were first introduced on x86 I remember a certain amount
of concern for the effects they would have on the fully assoc TLB.
The 32-bit x86 has a 20-bit Virtual Page Number so adding a 12-bit ASID
would increase the TLB size and power usage by 60%.

TLB CAM size did grow by 60%. TLB data store by next to nothing.

Power grew by trifling, because ASID is constant between context switch.
That is: the wires into TLB corresponding to ASID change once and remain constant until changed again, so no dynamic power--and the CAM cells can
remain matched/unMatched for the duration. So ASID only consumes TLB power
on context switch and TLB PTE install. Under normal conditions this is not
that much power.

Back then I spent some time looking at how to do local 4-bit ASIDs per-core. That is, each core has its own 4-bit ASID counter and each process
remembers which ASID its used last on each core.
Each core's TLB retains entries for up to the last 16 processes to run threads on that core, and each core recycles its ASID independently.
The 4-bit size was chosen based on the assumption that the odds are small
in a TLB containing useful entries beyond the last 16 processes to run.

My guess is that all OS threads {interrupts, exceptions, OS threads} used
the same ASID (likely 0000) so, you only really have 15 application threads.

This eliminates many the global ASID shoot downs but 4 bits is too small as it would recycle too quickly causing too frequent local ASID invalidates.

Especially when one has HyperVisor, Host OS, Guest OS ... with each Guest OS thinking that it has its own set of ASIDs.

In my system, there are 4-sets of 16-bit ASIDs active at all times, an ASID
for {HV, HOS, GOS, App}

Combining all the ideas, 16+ bit ASID, locally assigned per core,
with a batch TLB ASID invalidate mechanism using a bit mask for matching ASIDs so I can invalidate all ids in just one of the groups I mentioned elsewhere; GGxx_xxxx_xxxx_xxxx where GG is the 2-bit group number.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 15:26:09 2026

From Newsgroup: comp.arch

On 2/3/26 1:03 AM, Robert Finch wrote:

After getting bogged down testing the hash table, I decided to
shelf it for now. And...

I re-wrote Qupls4 TLB for hierarchical page tables. It is much
cleaner code now. The TLB is taking up only about 600 LUTs and 8
BRAMs now. It is also simpler being only a single level TLB.
Three copies of the TLB would be needed.

Thinking about adding caching of translation steps to avoid main
memory access.

If you use huge pages where page table nodes can be larger
pages, then you might consider merging the page directory entry
and huge PTE cache. This wastes N bits (10 bits for 32-bit PTEs
and 4 KiB pages) for the PDE unless huge pages use these bits
for additional TLB-managed features (or PDEs are constrained to
an N-bit smaller address space, which is awkward). However, such
allows entries to be used for both kinds of data (since both
have the same index:tag bits). (if I recall correctly x86
allowed 4 MiB pages to use a 42-bit address space by using the
bits that would otherwise be zero due to 4 MiB huge page
alignment. (I think this sharing works better for a modest-sized
L2 huge page TLB/PDE cache.)

This sharing saves a few tag bits compared with hash/rehash TLB
for multiple page sizes at the cost of fixed large region
caching size (though a hash/rehash TLB could repurpose the 2N
bits rCo N from the tag and N from the translation for huge
pages).

In theory, with rehashing the reduced tag entries could be used
for base-sized pages that can be sign extended for the missing
bits. Such might be useful as a victim cache, but I suspect such
a feature would not be worthwhile.

With PDE caching, it also becomes possible to use Andr|- Seznec's
"Don't use the page number but a pointer to it" for TLB tags to
compress the tags. (The PDE cache already holds the top virtual
address bits [and any ASID], so the small page TLB entries can
use a pointer to the PDE cache entry to compress the TLB entry.
Unlike Seznec's proposal, this requires reading an extra table,
though if the PDE cache used for this compression is also a
large page TLB then parallel access might be desired anyway.)

Replacement decisions might be more complicated with support for
different entry types.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 21:28:14 2026

From Newsgroup: comp.arch

On 2/8/26 5:57 PM, BGB wrote:

On 2/5/2026 11:23 PM, Kent Dickey wrote:

In article <1770238707-5857@newsgrouper.org>,

[snip]

I think the 32-entry fully associative is the primary TLB.
The 1024 entry
is a secondary TLB.-a Hits in the primary TLB cause no stalls
at all.
Misses in the primary probably lead to stalls (but it is
possible the
6 cycle latency can be hidden sometimes), so it costs
performance.

If the primary TLB (the one with no stalls) is only 4 entries,
then
we can easily calculate a pretty substantial stall penalty for
most
programs since it will pay the 6 cycle stall a lot (and that's
for a L2
TLB hit).-a So we make the L1 TLB as big as possible, but it
cannot be made
too large or its too slow, so it ends up between 32- and
128-entries
(256 tops).

Full associativity isn't worth it.

The gains from more associativity seem to max out at around
8-way, and 32 pages isn't nearly enough coverage (very high TLB
miss rates would be unavoidable).

Full associativity is not just about conflict misses. Such also
facilitates supporting multiple page sizes in the same TLB.
Different page sizes naturally use different indexing bits, so
having zero indexing bits means that any page can occupy any entry.

Andr|- Seznec proposed a way to use (overlaid) skewed
associativity to support multiple page sizes ("Concurrent
Support of Multiple Page Sizes On a Skewed Associative TLB",
2003). This has the limitation of requiring shared indexing bits
so that the different page size probes (done in parallel) can be
mapped to different banks. This also requires accessing more
entries with limited associativity (the skewed associative
conflict avoids reduces the miss rate that this would otherwise
cause as well as balances utilization among page sizes).

AMD used full associativity to support coalescing of 4KiB pages
into 16KiB (32KiB in Zen+ and earlier) TLB entries. (An ordinary
8-way associative TLB indexed for 16KiB "pages" would map all
uncoalescable 4KiB pages in a 16KiB chunk to the same set,
increasing conflict misses.) It might have been possible to
index by the shared bits of coalesced and base page sizes and
use a CAM-based predictor to choose whether to index the rest by
base page or by coalesced page, but I suspect such would not be
that much smaller than using full associativity and likely
slower more power consuming as well as vulnerable to false
conflicts from the predictor limiting which PTEs can share a set.

Full associativity may also have been important for Itanium 2's
prevalidated cache, where each cache tag had one hot bit
indicating which TLB entry applied to it. This allowed faster
tag check and flash invalidation of (write-through) cache
entries when a TLB entry was evicted. The number of bits in the
cache tag limits the TLB capacity, which makes associativity
more important, and TLB misses are more expensive since they
might invalidate useful cache blocks. (Write through may have
been desirable for reliability or been a consequence of seeking
a low latency mechanism for L1 accesses.)

Of course, the tradeoffs are different for an FPGA implementation.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 21:55:29 2026

From Newsgroup: comp.arch

On 2/6/26 9:29 AM, EricP wrote:

Anton Ertl wrote:

Robert Finch <robfi680@gmail.com> writes:

On 2026-02-01 12:49 p.m., EricP wrote:

Simpler to have create process assign a new ASID by
incrementing
an unsigned up counter that starts at 1. When you hit its
max then reset
to 1 and clear the whole IVT for all non-zero ASID entries
in one pass.

A garbage collection approach may work for removed entries.
It might be a complicated state machine though to do in HW.
It could look for invalid or entries with defunct ASIDs.

AFAIK the approach outlined by EricP has been used in commercial
hardware.-a It's simple, and if there are enough ASIDs, cheap
(because
resets are rare).-a So better add more bits to the ASIDs than
introduce complicated machinery for reclaiming ASIDs.

- anton

The method I described before was a bit brute force, as it looks
like
it has to stall the whole system while it erases the old ASIDs.
It is inefficient as it erases all ASIDs including ones recently
assigned.

One issue with invalidating ASIDs is that the L1-TLB may be fully associative, the L2-TLB often are set associative.
While a a full assoc TLB can invalidate multiple matching
entries in
1 clock, the set assoc TLB requires a scan doing a
read-compare-write
operation on all the ways in each set, taking 2 or more clocks
per set.
So an L2-TLB with 512 sets could take 1024 clocks or more to
complete
an ASID shoot down.

This would also cause all running processes to have a flurry of
TLB misses
as they reload the entries just purged.

I would want something soother, continuous, that usually didn't
need to
stall the whole system, and only recycles the oldest ASIDs.

Something like, split the ASIDs into 4 groups using the high 2 msb.
As the ASID counter increments from group N to group N+1, it
initiates an
invalidate of group N+2, such that it retains the recent ASIDs
in group N
and N-1. And this is circular so when it starts assigning in G3 it
initiates the invalidate of G0. The invalidate is also
asynchronous,
so the system can continue to run and assign ASIDs while it
invalidates
the next group ahead (there are some interlocks waiting for ACK's).

e.g. Say the ASID counter is 7 bits so 4 groups of 32 IDs,
called G0..G3.
The current assignment ASID counter = 31 in G1.
When its increments to 32 in G2, it begins invalidating all
group G3 ids
in the range 96..128. Any processes currently using ASIDs in the
G3 range
have their ASID invalidated, and when they next run will assign
a new id
in the current assignment group. All processes in groups G0 and
G1 are
unaffected by this ASID recycling. Core broadcasts an IPI
Interprocessor
Interrupt to all other cores to shoot down all group G3 ASIDs.

One could also use a one hot bit encoding with shared write
ports (flash invalidation). Four groups would add two bits of
overhead (plus whatever overhead is associated with the shared
write port).

I suspect pre-scrubbing via probing would be more preferred.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Mon Feb 9 00:11:45 2026

From Newsgroup: comp.arch

On 2026-02-08 9:28 p.m., Paul Clayton wrote:

On 2/8/26 5:57 PM, BGB wrote:

On 2/5/2026 11:23 PM, Kent Dickey wrote:

In article <1770238707-5857@newsgrouper.org>,

[snip]

I think the 32-entry fully associative is the primary TLB. The 1024
entry
is a secondary TLB.-a Hits in the primary TLB cause no stalls at all.
Misses in the primary probably lead to stalls (but it is possible the
6 cycle latency can be hidden sometimes), so it costs performance.

If the primary TLB (the one with no stalls) is only 4 entries, then
we can easily calculate a pretty substantial stall penalty for most
programs since it will pay the 6 cycle stall a lot (and that's for a L2
TLB hit).-a So we make the L1 TLB as big as possible, but it cannot be
made
too large or its too slow, so it ends up between 32- and 128-entries
(256 tops).

Full associativity isn't worth it.

The gains from more associativity seem to max out at around 8-way, and
32 pages isn't nearly enough coverage (very high TLB miss rates would
be unavoidable).

Full associativity is not just about conflict misses. Such also
facilitates supporting multiple page sizes in the same TLB. Different
page sizes naturally use different indexing bits, so having zero
indexing bits means that any page can occupy any entry.

Andr|- Seznec proposed a way to use (overlaid) skewed associativity to support multiple page sizes ("Concurrent Support of Multiple Page Sizes
On a Skewed Associative TLB", 2003). This has the limitation of
requiring shared indexing bits so that the different page size probes
(done in parallel) can be mapped to different banks. This also requires accessing more entries with limited associativity (the skewed
associative conflict avoids reduces the miss rate that this would
otherwise cause as well as balances utilization among page sizes).

The article requires signing in with an IEEE membership. I used to have
a student membership many years ago. I let it expire as I was not using
it, this was before the internet was big. Every so often I contemplate
getting one, but I am trying to keep costs down.

I have gotten a pretty good imagination for how things work just by
reading headlines or summaries.

I have gone ahead and made a TLB that is only one level though to
conserve resources. The TLB is four-way associative with 512 entries
(making it smaller does not reduce resource usage). I found the BRAM TLB
could be made fast enough to support the CPU with single cycle
translations which means having a faster L1 was not a benefit.

AMD used full associativity to support coalescing of 4KiB pages into
16KiB (32KiB in Zen+ and earlier) TLB entries. (An ordinary 8-way associative TLB indexed for 16KiB "pages" would map all uncoalescable
4KiB pages in a 16KiB chunk to the same set, increasing conflict
misses.) It might have been possible to index by the shared bits of coalesced and base page sizes and use a CAM-based predictor to choose whether to index the rest by base page or by coalesced page, but I
suspect such would not be that much smaller than using full
associativity and likely slower more power consuming as well as
vulnerable to false conflicts from the predictor limiting which PTEs can share a set.

Full associativity may also have been important for Itanium 2's
prevalidated cache, where each cache tag had one hot bit indicating
which TLB entry applied to it. This allowed faster tag check and flash invalidation of (write-through) cache entries when a TLB entry was
evicted. The number of bits in the cache tag limits the TLB capacity,
which makes associativity more important, and TLB misses are more
expensive since they might invalidate useful cache blocks. (Write
through may have been desirable for reliability or been a consequence of seeking a low latency mechanism for L1 accesses.)

Of course, the tradeoffs are different for an FPGA implementation.

Yeah, CAMs are expensive in the FPGAs. They are often two cycle too.

--- Synchronet 3.21b-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Feb 9 15:44:26 2026

From Newsgroup: comp.arch

Robert Finch wrote:

On 2026-02-08 9:28 p.m., Paul Clayton wrote:

Andr|- Seznec proposed a way to use (overlaid) skewed associativity to
support multiple page sizes ("Concurrent Support of Multiple Page
Sizes On a Skewed Associative TLB", 2003). This has the limitation of
requiring shared indexing bits so that the different page size probes
(done in parallel) can be mapped to different banks. This also
requires accessing more entries with limited associativity (the skewed
associative conflict avoids reduces the miss rate that this would
otherwise cause as well as balances utilization among page sizes).

The article requires signing in with an IEEE membership. I used to have
a student membership many years ago. I let it expire as I was not using
it, this was before the internet was big. Every so often I contemplate getting one, but I am trying to keep costs down.

A bit of poking about finds her home page

https://team.inria.fr/pacap/members/andre-seznec/

which links to

https://team.inria.fr/pacap/members/andre-seznec/cache-architecture-research-2/

and to her paper

Concurrent Support of Multiple Page Sizes on a Skewed Associative TLB, 2003 http://www.irisa.fr/caps/people/seznec/SKEWEDTLBperso.pdf

--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Feb 9 16:27:15 2026

From Newsgroup: comp.arch

On 2/9/2026 2:44 PM, EricP wrote:

Robert Finch wrote:

On 2026-02-08 9:28 p.m., Paul Clayton wrote:

Andr|- Seznec proposed a way to use (overlaid) skewed associativity to
support multiple page sizes ("Concurrent Support of Multiple Page
Sizes On a Skewed Associative TLB", 2003). This has the limitation of
requiring shared indexing bits so that the different page size probes
(done in parallel) can be mapped to different banks. This also
requires accessing more entries with limited associativity (the
skewed associative conflict avoids reduces the miss rate that this
would otherwise cause as well as balances utilization among page sizes). >>>

The article requires signing in with an IEEE membership. I used to
have a student membership many years ago. I let it expire as I was not
using it, this was before the internet was big. Every so often I
contemplate getting one, but I am trying to keep costs down.

A bit of poking about finds her home page

https://team.inria.fr/pacap/members/andre-seznec/

which links to

https://team.inria.fr/pacap/members/andre-seznec/cache-architecture- research-2/

and to her paper

Concurrent Support of Multiple Page Sizes on a Skewed Associative TLB, 2003 http://www.irisa.fr/caps/people/seznec/SKEWEDTLBperso.pdf

My usual thing here was:
In mixed page-size configurations, the MMU is set to the smallest in-use
page size;
When set to larger page sizes, a few of the bits that were low-order
bits with a smaller page size change to being high-order bits with a
larger page size (all other bits remaining in the same place).

So, say:
If using both 4K and 16K pages, MMU is set for 4K;
But, can set it for 16K if no 4K pages are being used;
Or, 64K if not 4K or 16K pages are being used.

With normal page tables, there is a defined "basic" page size and any
other page sizes would exist as some multiple of this size. There is a
slight inefficiency when dealing with larger pages with a smaller basic
page size, but this was a reasonable tradeoff.

If the page size changes between ASIDs, this isn't a huge issue as
normally ASIDs can't see each other's pages anyways, except for global
pages, but if using page-tables then any global pages shared between
page tables would implicitly need the same page size.

Seemed "obvious enough".

Paper seems to have the idea of subdividing associativity if multiple
page sizes are in use at the same time. I guess this works, if one has
enough associativity.

In this case 4 active page sizes would require at least an 8-way TLB to
avoid "shooting oneself in the foot" though. Here, 2 active page sizes
could be used with a 4-way TLB, where at least 2-way is needed (assuming modulo addressing) to maintain stability.

This also works I guess.

Ironically, a similar downside existed in my case when using Addr96
modes, where this also came at the cost of reducing TLB associativity
(though, this cost can be reduced when using primarily 48-bit
addressing, *).

*: Say, for example, if pages are either part of:
The "null quadrant" (low 48 bit range of the 96-bit VAS);
Part of the "current quadrant" (as defined by GBH);
Normal 48-bit VAS memory ops are relative to GBH;
Or, PCH for I$, but nominally PCH==GBH.
Except maybe if you want to do a Harvard Addressing thing.
In this case, can potentially key the high 48 bits to the ASID, so
things still work so long as GBH is assumed to not change within a given
ASID, and the page-miss handler can upload pages in the cheaper 48-bit
VAS format rather than 96-bit VAS format (and sidestep the associativity cost).

But, have mostly ended up not using the 96-bit VAS stuff, as it is kinda overkill.

Even if, yeah, do have a chunk of encoding space being eaten by
Load/Store ops that are mostly being unused and exist primarily to deal
with 128-bit pointers. In this case would still need a 96-bit TLBE
whenever accessing a page in some other part of the VAS.

...

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Mon Feb 9 20:56:34 2026

From Newsgroup: comp.arch

On 2026-02-09 5:27 p.m., BGB wrote:

On 2/9/2026 2:44 PM, EricP wrote:

Robert Finch wrote:

On 2026-02-08 9:28 p.m., Paul Clayton wrote:

Andr|- Seznec proposed a way to use (overlaid) skewed associativity
to support multiple page sizes ("Concurrent Support of Multiple Page
Sizes On a Skewed Associative TLB", 2003). This has the limitation
of requiring shared indexing bits so that the different page size
probes (done in parallel) can be mapped to different banks. This
also requires accessing more entries with limited associativity (the
skewed associative conflict avoids reduces the miss rate that this
would otherwise cause as well as balances utilization among page
sizes).

The article requires signing in with an IEEE membership. I used to
have a student membership many years ago. I let it expire as I was
not using it, this was before the internet was big. Every so often I
contemplate getting one, but I am trying to keep costs down.

A bit of poking about finds her home page

https://team.inria.fr/pacap/members/andre-seznec/

which links to

https://team.inria.fr/pacap/members/andre-seznec/cache-architecture-
research-2/

and to her paper

Concurrent Support of Multiple Page Sizes on a Skewed Associative TLB,
2003
http://www.irisa.fr/caps/people/seznec/SKEWEDTLBperso.pdf

My usual thing here was:
In mixed page-size configurations, the MMU is set to the smallest in-use page size;
When set to larger page sizes, a few of the bits that were low-order
bits with a smaller page size change to being high-order bits with a
larger page size (all other bits remaining in the same place).

So, say:
If using both 4K and 16K pages, MMU is set for 4K;
But, can set it for 16K if no 4K pages are being used;
Or, 64K if not 4K or 16K pages are being used.

With normal page tables, there is a defined "basic" page size and any
other page sizes would exist as some multiple of this size. There is a slight inefficiency when dealing with larger pages with a smaller basic
page size, but this was a reasonable tradeoff.

If the page size changes between ASIDs, this isn't a huge issue as
normally ASIDs can't see each other's pages anyways, except for global pages, but if using page-tables then any global pages shared between
page tables would implicitly need the same page size.

Seemed "obvious enough".

Paper seems to have the idea of subdividing associativity if multiple
page sizes are in use at the same time. I guess this works, if one has enough associativity.

In this case 4 active page sizes would require at least an 8-way TLB to avoid "shooting oneself in the foot" though. Here, 2 active page sizes
could be used with a 4-way TLB, where at least 2-way is needed (assuming modulo addressing) to maintain stability.

This also works I guess.

Ironically, a similar downside existed in my case when using Addr96
modes, where this also came at the cost of reducing TLB associativity (though, this cost can be reduced when using primarily 48-bit
addressing, *).

*: Say, for example, if pages are either part of:
-a The "null quadrant" (low 48 bit range of the 96-bit VAS);
-a Part of the "current quadrant" (as defined by GBH);
-a-a-a Normal 48-bit VAS memory ops are relative to GBH;
-a-a-a Or, PCH for I$, but nominally PCH==GBH.
-a-a-a-a-a Except maybe if you want to do a Harvard Addressing thing.
In this case, can potentially key the high 48 bits to the ASID, so
things still work so long as GBH is assumed to not change within a given ASID, and the page-miss handler can upload pages in the cheaper 48-bit
VAS format rather than 96-bit VAS format (and sidestep the associativity cost).

But, have mostly ended up not using the 96-bit VAS stuff, as it is kinda overkill.

Even if, yeah, do have a chunk of encoding space being eaten by Load/
Store ops that are mostly being unused and exist primarily to deal with 128-bit pointers. In this case would still need a 96-bit TLBE whenever accessing a page in some other part of the VAS.

...

I found a slide presentation on the idea.

I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if
there is a different size page in one of the ways would it still work?

--- Synchronet 3.21b-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Mon Feb 9 21:26:40 2026

From Newsgroup: comp.arch

On 2/9/26 12:11 AM, Robert Finch wrote:

On 2026-02-08 9:28 p.m., Paul Clayton wrote:

[snip]

Andr|- Seznec proposed a way to use (overlaid) skewed
associativity to support multiple page sizes ("Concurrent
Support of Multiple Page Sizes On a Skewed Associative TLB",
2003).

[snip]

The article requires signing in with an IEEE membership. I used
to have a student membership many years ago. I let it expire as
I was not using it, this was before the internet was big. Every
so often I contemplate getting one, but I am trying to keep
costs down.

Duck Duck Go provided the following url: https://www.irisa.fr/caps/people/seznec/SKEWEDTLBperso.pdf
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Feb 10 19:04:40 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-09 5:27 p.m., BGB wrote:

On 2/9/2026 2:44 PM, EricP wrote:

------------------------------------

I found a slide presentation on the idea.

I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if there is a different size page in one of the ways would it still work?

One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.

In practice it is easier to implement and almost as good.
--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Feb 10 14:17:27 2026

From Newsgroup: comp.arch

On 2/10/2026 1:04 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-09 5:27 p.m., BGB wrote:

On 2/9/2026 2:44 PM, EricP wrote:

------------------------------------

I found a slide presentation on the idea.

I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if
there is a different size page in one of the ways would it still work?

One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.

In practice it is easier to implement and almost as good.

FWIW:
I ended up not using an MRU or LRU for the TLB.

While there is a loss of efficiency here (it always evicts the oldest
page rather than the least-used), it does mean that the TLB miss handler
can model the TLB deterministically (if the handler knows the length and width; it can be modeled exactly).

Potentially a clever TLB miss handler could compensate by re-injecting
the page that fell off the end if it is very likely to trigger another
TLB miss (possibly tracking commonly-missed pages in an STF MRU or
similar in software, and re-injecting pages that appear within this MRU
at a higher rank than whatever is in the current last place).

Another theoretical possibility would be to allow the CPU to access the contents of the TLB via MMIO, but there are strong reasons to not do
this. So, naive FIFO ended up being preferable. The other possibility
being to have LRU, but then needing to fully evict a given set whenever
a pages' attributes are modified (such as a page swapped in or out of
the pagefile).

It is tradeoffs here...

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Feb 10 21:56:59 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 2/10/2026 1:04 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-09 5:27 p.m., BGB wrote:

On 2/9/2026 2:44 PM, EricP wrote:

------------------------------------

I found a slide presentation on the idea.

I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if
there is a different size page in one of the ways would it still work?

One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.

In practice it is easier to implement and almost as good.

FWIW:
I ended up not using an MRU or LRU for the TLB.

My NRU is comprised of a single S-R flip-flop per PTE, with all of them
wired to an (high-fan-in) AND gate.

When all NRU-bits are set, AND gate asserts, and all reset synchronously.

When TLB takes a miss, Fint-First logic chooses a unary index for the new
PTE. Writing (or reading) a PTE sets its S-R FF.

Simple and effective.

While there is a loss of efficiency here (it always evicts the oldest
page rather than the least-used), it does mean that the TLB miss handler
can model the TLB deterministically (if the handler knows the length and width; it can be modeled exactly).

Potentially a clever TLB miss handler could compensate by re-injecting
the page that fell off the end if it is very likely to trigger another
TLB miss (possibly tracking commonly-missed pages in an STF MRU or
similar in software, and re-injecting pages that appear within this MRU
at a higher rank than whatever is in the current last place).

Another theoretical possibility would be to allow the CPU to access the contents of the TLB via MMIO, but there are strong reasons to not do
this.

MMI/O is potentially SLOW.

So, naive FIFO ended up being preferable. The other possibility
being to have LRU, but then needing to fully evict a given set whenever
a pages' attributes are modified (such as a page swapped in or out of
the pagefile).

True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).

It is tradeoffs here...

--- Synchronet 3.21b-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Feb 10 14:39:07 2026

From Newsgroup: comp.arch

On 2/10/2026 1:56 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/10/2026 1:04 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-09 5:27 p.m., BGB wrote:

On 2/9/2026 2:44 PM, EricP wrote:

------------------------------------

I found a slide presentation on the idea.

I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work?

One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.

In practice it is easier to implement and almost as good.

FWIW:
I ended up not using an MRU or LRU for the TLB.

My NRU is comprised of a single S-R flip-flop per PTE, with all of them
wired to an (high-fan-in) AND gate.

When all NRU-bits are set, AND gate asserts, and all reset synchronously.

When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.

Simple and effective.

I presume NRU means "not recently used". But if when all the NRU bits
are set, all are reset, including the MRU entry's bit, might it get overwritten?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 11 01:42:01 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 2/10/2026 1:56 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/10/2026 1:04 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-09 5:27 p.m., BGB wrote:

On 2/9/2026 2:44 PM, EricP wrote:

------------------------------------

I found a slide presentation on the idea.

I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work? >>>

One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.

In practice it is easier to implement and almost as good.

FWIW:
I ended up not using an MRU or LRU for the TLB.

My NRU is comprised of a single S-R flip-flop per PTE, with all of them wired to an (high-fan-in) AND gate.

When all NRU-bits are set, AND gate asserts, and all reset synchronously.

When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.

Simple and effective.

I presume NRU means "not recently used". But if when all the NRU bits
are set, all are reset, including the MRU entry's bit, might it get overwritten?

The logic above is the simplest that performs "reasonably well", if certain timing requirements can be met, the new entry is marked recently-used. if
not, you lose a little in event-performance but gain in pipeline frequency.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Feb 11 00:01:50 2026

From Newsgroup: comp.arch

On 2026-02-10 4:56 p.m., MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/10/2026 1:04 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-09 5:27 p.m., BGB wrote:

On 2/9/2026 2:44 PM, EricP wrote:

------------------------------------

I found a slide presentation on the idea.

I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work?

One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.

In practice it is easier to implement and almost as good.

FWIW:
I ended up not using an MRU or LRU for the TLB.

My NRU is comprised of a single S-R flip-flop per PTE, with all of them
wired to an (high-fan-in) AND gate.

When all NRU-bits are set, AND gate asserts, and all reset synchronously.

When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.

Simple and effective.

While there is a loss of efficiency here (it always evicts the oldest
page rather than the least-used), it does mean that the TLB miss handler
can model the TLB deterministically (if the handler knows the length and
width; it can be modeled exactly).

Potentially a clever TLB miss handler could compensate by re-injecting
the page that fell off the end if it is very likely to trigger another
TLB miss (possibly tracking commonly-missed pages in an STF MRU or
similar in software, and re-injecting pages that appear within this MRU
at a higher rank than whatever is in the current last place).

Another theoretical possibility would be to allow the CPU to access the
contents of the TLB via MMIO, but there are strong reasons to not do
this.

MMI/O is potentially SLOW.

So, naive FIFO ended up being preferable. The other possibility
being to have LRU, but then needing to fully evict a given set whenever
a pages' attributes are modified (such as a page swapped in or out of
the pagefile).

True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).

I did not realize it was so many gates for LRU? I thought it was mainly latches. The LRU in use for Qupls has the the ways connected as a shift.
Then new entries are always added at way three for a four-way TLB. way
zero gets evicted.

If it is lower gate count for NRU the TLB may be switched - another
re-write.

The Qupls TLB is accessible as MMIO but this is probably unnecessary. It
is also updated by a SCP now instead of a state machine. The SCP also
handles Amiga Copper style audio/video co-processing.

It is tradeoffs here...

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Feb 11 01:02:06 2026

From Newsgroup: comp.arch

On 2026-02-11 12:01 a.m., Robert Finch wrote:

On 2026-02-10 4:56 p.m., MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/10/2026 1:04 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-09 5:27 p.m., BGB wrote:

On 2/9/2026 2:44 PM, EricP wrote:

------------------------------------

I found a slide presentation on the idea.

I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU.
So, if
there is a different size page in one of the ways would it still work? >>>>

One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.

In practice it is easier to implement and almost as good.

FWIW:
I ended up not using an MRU or LRU for the TLB.

My NRU is comprised of a single S-R flip-flop per PTE, with all of them
wired to an (high-fan-in) AND gate.

When all NRU-bits are set, AND gate asserts, and all reset synchronously.

When TLB takes a miss, Fint-First logic chooses a unary index for the new
PTE. Writing (or reading) a PTE sets its S-R FF.

Simple and effective.

While there is a loss of efficiency here (it always evicts the oldest
page rather than the least-used), it does mean that the TLB miss handler >>> can model the TLB deterministically (if the handler knows the length and >>> width; it can be modeled exactly).

Potentially a clever TLB miss handler could compensate by re-injecting
the page that fell off the end if it is very likely to trigger another
TLB miss (possibly tracking commonly-missed pages in an STF MRU or
similar in software, and re-injecting pages that appear within this MRU
at a higher rank than whatever is in the current last place).

Another theoretical possibility would be to allow the CPU to access the
contents of the TLB via MMIO, but there are strong reasons to not do
this.

MMI/O is potentially SLOW.

-a-a-a-a-a-a So, naive FIFO ended up being preferable. The other possibility
being to have LRU, but then needing to fully evict a given set whenever
a pages' attributes are modified (such as a page swapped in or out of
the pagefile).

True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).

I did not realize it was so many gates for LRU? I thought it was mainly latches. The LRU in use for Qupls has the the ways connected as a shift. Then new entries are always added at way three for a four-way TLB. way
zero gets evicted.

If it is lower gate count for NRU the TLB may be switched - another re- write.

The Qupls TLB is accessible as MMIO but this is probably unnecessary. It
is also updated by a SCP now instead of a state machine. The SCP also handles Amiga Copper style audio/video co-processing.

I quickly modified the Qupls TLB for NRU updates and then compared NRU,
LRU, and RANDOM
// TLB is four-way 512 entries.
// 3000 LUTs / 500 FFs / 8 BRAMs / 205 MHz RANDOM
// 2900 LUTs / 500 FFs / 8 BRAMs / 205 MHz LRU
// 2900 LUTs / 500 FFs / 8 BRAMs / 205 MHz NRU

I did not get the results I was expecting. In my FPGA implementation
they are all close to the same. The implementation just has a single
ported TLB so there is some muxing for updates. It could be all the
other logic drowns out the difference.

It is tradeoffs here...

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 11 18:15:25 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-10 4:56 p.m., MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/10/2026 1:04 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-09 5:27 p.m., BGB wrote:

On 2/9/2026 2:44 PM, EricP wrote:

------------------------------------

I found a slide presentation on the idea.

I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work? >>>

One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.

In practice it is easier to implement and almost as good.

FWIW:
I ended up not using an MRU or LRU for the TLB.

My NRU is comprised of a single S-R flip-flop per PTE, with all of them wired to an (high-fan-in) AND gate.

When all NRU-bits are set, AND gate asserts, and all reset synchronously.

When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.

Simple and effective.

While there is a loss of efficiency here (it always evicts the oldest
page rather than the least-used), it does mean that the TLB miss handler >> can model the TLB deterministically (if the handler knows the length and >> width; it can be modeled exactly).

Potentially a clever TLB miss handler could compensate by re-injecting
the page that fell off the end if it is very likely to trigger another
TLB miss (possibly tracking commonly-missed pages in an STF MRU or
similar in software, and re-injecting pages that appear within this MRU
at a higher rank than whatever is in the current last place).

Another theoretical possibility would be to allow the CPU to access the
contents of the TLB via MMIO, but there are strong reasons to not do
this.

MMI/O is potentially SLOW.

So, naive FIFO ended up being preferable. The other possibility
being to have LRU, but then needing to fully evict a given set whenever
a pages' attributes are modified (such as a page swapped in or out of
the pagefile).

True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).

I did not realize it was so many gates for LRU? I thought it was mainly latches. The LRU in use for Qupls has the the ways connected as a shift. Then new entries are always added at way three for a four-way TLB. way
zero gets evicted.

True LRU for 4-way set is 26-gates plus 4 D-type flip-flops.
There are just too many states and too many state transitions.

MRU is 1 logic gate plus 1 SR-flip-flop per entry and a FF0 across
the whole set.

You could never make true LRU for 64-E TLB, but MRU is too small to see.

We (logic designers) dislike shift registers, but I think your
use is clever.

If it is lower gate count for NRU the TLB may be switched - another re-write.

The Qupls TLB is accessible as MMIO but this is probably unnecessary. It
is also updated by a SCP now instead of a state machine. The SCP also handles Amiga Copper style audio/video co-processing.

It is tradeoffs here...

--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Feb 11 18:06:11 2026

From Newsgroup: comp.arch

On 2/10/2026 3:56 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/10/2026 1:04 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-09 5:27 p.m., BGB wrote:

On 2/9/2026 2:44 PM, EricP wrote:

------------------------------------

I found a slide presentation on the idea.

I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work?

One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.

In practice it is easier to implement and almost as good.

FWIW:
I ended up not using an MRU or LRU for the TLB.

My NRU is comprised of a single S-R flip-flop per PTE, with all of them
wired to an (high-fan-in) AND gate.

When all NRU-bits are set, AND gate asserts, and all reset synchronously.

When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.

Simple and effective.

All pose a similar issue though in that they introduce behavior that is
not visible to the miss handler, and thus the exact state of the TLB
can't be modeled in software. Well, unless maybe there were some good
way to estimate this in SW and with a LDTLB which encoded which way to
load the TLBE into.

LDTLB: Load register pair into TLB, push-back by 1.
LDTLB{1/2/3/4}; Load into 1st/2nd/3rd/4th way of TLB.

Then again, in this case, NRU state would be 1024 bits, and could
potentially be made accessible via a special mechanism (say, an LDNRU instruction). Would still be too bulky in this case to route it through
CPUID, and wouldn't want to get the MMIO mechanism involved.

Well, maybe could special case-it:
MMIO requests to the TLB need not make it all the way to the MMIO bus,
the TLB could just see the request passing by and be like "hey, that
one's for me" and handle it directly as it passes by (a similar
mechanism having already used for LDTLB, which sort of reach the TLB
like specially malformed memory stores).

If going this route, could maybe make sense to allow an access per slot,
where one can fetch accessed (A) and Dirty (D) bits, possibly along with
any other state that is being tracked (probably all 0 for now).

While there is a loss of efficiency here (it always evicts the oldest
page rather than the least-used), it does mean that the TLB miss handler
can model the TLB deterministically (if the handler knows the length and
width; it can be modeled exactly).

Potentially a clever TLB miss handler could compensate by re-injecting
the page that fell off the end if it is very likely to trigger another
TLB miss (possibly tracking commonly-missed pages in an STF MRU or
similar in software, and re-injecting pages that appear within this MRU
at a higher rank than whatever is in the current last place).

Another theoretical possibility would be to allow the CPU to access the
contents of the TLB via MMIO, but there are strong reasons to not do
this.

MMI/O is potentially SLOW.

Beyond this... If we are already paying the cost of a exception handler
for TLB Miss, then some extra MMIO isn't likely a huge issue.

Bigger issue is that it requires a special mechanism to do so, which
could risk an undesirable level of resource cost impact to the TLB. This
could be reduced if the access pattern for the TLB matches that of its
normal operation though.

If I were to design such an MMIO interface, maybe:
( 2:0): Always 0, 64-bit MMIO only
( 4:3): Word Selector (Provision for 256-bit TLBEs; Currently 128-bit)
( 7:5): TLB Way, make provision for 8-way TLB
(17:8): TLB Slot, making provision for 1024x TLB
(27:18): Constant Pattern, TBD
(47:28): FFFFF, Lower-range MMIO Space

Some quick/dirty mental mathing implies a base address of probably:
FFFF_FE000000..FFFF_FE03FFFF

Might need more address range if I want additional metadata, or
provision for a bigger TLB.

Or similar...

Can note that SH-4 had an MMIO interface for the TLB, but initially I
didn't want to go this way.

So, naive FIFO ended up being preferable. The other possibility
being to have LRU, but then needing to fully evict a given set whenever
a pages' attributes are modified (such as a page swapped in or out of
the pagefile).

True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).

oK.

I guess I can note that my use of LRU/MRU was most often based on STF,
in a naive case:
Access 3: Swap 3 and 2
Access 2: Swap 2 and 1
Access 1: Swap 1 and 0
Access 0: No Change

With insertion adding to the front and pushing everything back.

Some more advanced ranking would involve access counts and swapping
based on relative access frequency (to give a "true ranking order"), but
I haven't usually done this (usually not worth the added cost vs the approximate answer that the STF approach gives).

IIRC I had used a more convoluted pattern to deal with Addr96 though, so (uploading an Addr48 TLBE):
1:0 -> 3:2
2 -> 1 (if it was a 48-bit TLBE)
New -> 0
1 -> Discarded

With the Addr96 case being:
1:0 -> 3:2
New -> 1:0

Rather than a more naive pattern for Addr48:
3 -> Discarded
2 -> 3
1 -> 2
0 -> 1
New -> 0

The more convoluted pattern mostly to deal with 96-bit addressing as
paired TLBEs.

Well, sort of related to a recent experiment with trying to add
networking support to a 3D engine of mine (my BT3 engine), and ended up instead trying to use behavioral determinism to sync mobs rather than
explicit messages.

The code was originally written to use random numbers, but then noted I
could use a 5D hash (X/Y/Z, Time, Aux) instead of a more traditional
RNG. Basically works, and easier to make deterministic in a wold where
what things are loaded in are variable (the Doom approach of using a globally-synchronized RNG would be N/A in this case).

Though, for something like a mob, maybe difficult to imagine. One could imagine the mob as a sort of particle floating along with Brownian
motion in a space where each spot along its path already has a vector
waiting to push it along as soon as it arrives. If you can already
determine the vectors, then starting from a known initial state it
should be possible to step forwards until the current time and end up
with everything in the same positions without needing any
synchronization messages to be sent between the clients.

Seemed kinda clever, but apparently isn't all that atypical.
Grok claimed that this same basic approach was fairly common in RTS
games though (eg, like StarCraft and WarCraft and similar), with some
other ideas I was floating here being more typical of MMOs.

Though, for terrain generation in my 3D engines had already tended to
use spatial hashing though.

Say, for example, in my current 3D engine for things like height-map:
Fill a lookup table with random numbers generated from a PRNG (256 entries); Hash the X/Y or X/Y/Z coords ("(((X*P1+Y*P2+Z*P3)*P4)>>SH)&255");
Select an index from the lookup table, and repeat with adjusted indices
for adjacent entries;
Interpolate between these.

In this case, plain spatial hashing RNG had already been used for things
like placing trees and similar (where the whole process needs to be deterministic across the space).

Adding a time aspect and the potential of player interactions does
complicate things, as any direct interaction between a player and a mob effectively breaks determinism (and a replay would require actively
recording and replaying player movements and actions as part of the simulation, but possibly the server could keep track of this, where
every player is for every tick, and trigger a partial re-sim in the case
of known determinism breakages). Either this, or the server itself plays
the sim forward (from the last reference-state) with the players added
in, and then makes a new snapshot post interaction which is then sent to
any relevant clients (as the new reference state).

Well, along with imposing some behavioral tweaks, like if no one is sufficiently nearby then the region borders will behave as invisible walls.

Say, for example, distance from nearest player:
0-47 meters: Full behavior;
48-95: Limiting advanced behaviors, region borders become walls;
96+: Mobs stop moving and AI behavior is disabled;
~ 128-192: Ticking stops entirely,
Limbo state until it exceeds render distance and fully unloads.

Can also use a similar sort of spacetime-hash RNG for things like
block-ticks (such as dirt blocks turning into grass), which also
shouldn't need explicit synchronization.

Oddly, in some of this stuff, there does seem to be curious physics
parallels though.

...

It is tradeoffs here...

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 12 02:16:13 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 2/10/2026 3:56 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/10/2026 1:04 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-09 5:27 p.m., BGB wrote:

On 2/9/2026 2:44 PM, EricP wrote:

------------------------------------

I found a slide presentation on the idea.

I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work? >>>

One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.

In practice it is easier to implement and almost as good.

FWIW:
I ended up not using an MRU or LRU for the TLB.

My NRU is comprised of a single S-R flip-flop per PTE, with all of them wired to an (high-fan-in) AND gate.

When all NRU-bits are set, AND gate asserts, and all reset synchronously.

When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.

Simple and effective.

All pose a similar issue though in that they introduce behavior that is
not visible to the miss handler, and thus the exact state of the TLB
can't be modeled in software. Well, unless maybe there were some good
way to estimate this in SW and with a LDTLB which encoded which way to
load the TLBE into.

LDTLB: Load register pair into TLB, push-back by 1.
LDTLB{1/2/3/4}; Load into 1st/2nd/3rd/4th way of TLB.

MIPS (company not Stanford) had the TLE logic find the entry to be
written (find first), so SW just hashed a big table, checked match
and if so, inserted new PTE where HW said.

IF you are going to do a SW TLB reloader, this is the way.

Then again, in this case, NRU state would be 1024 bits, and could potentially be made accessible via a special mechanism (say, an LDNRU instruction). Would still be too bulky in this case to route it through CPUID, and wouldn't want to get the MMIO mechanism involved.

Well, maybe could special case-it:
MMIO requests to the TLB need not make it all the way to the MMIO bus,
the TLB could just see the request passing by and be like "hey, that
one's for me" and handle it directly as it passes by (a similar
mechanism having already used for LDTLB, which sort of reach the TLB
like specially malformed memory stores).

In wide super-scalar machines all of the memory 'paths' need this logic,
and sometimes ordering of MMI/O becomes a "little tricky". ---------------------

MMI/O is potentially SLOW.

Beyond this... If we are already paying the cost of a exception handler
for TLB Miss, then some extra MMIO isn't likely a huge issue.

MIPS TLB reloader was 17 cycles (when instructions were in cache)
I have witnessed MMI/O that was slower than 1-|s !! Also the control
register bus in Athlon/Opteron was SLOW, too:: so you can't always
use control registers in lieu of MMI/O ...
-----------------

True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).

oK.

I guess I can note that my use of LRU/MRU was most often based on STF,
in a naive case:
Access 3: Swap 3 and 2
Access 2: Swap 2 and 1
Access 1: Swap 1 and 0
Access 0: No Change

SWAP moves the whole size of PTE !! whereas NRU sets 1 bit.
Would never be allowed in a low-power implementation.
----------------

Say, for example, in my current 3D engine for things like height-map:
Fill a lookup table with random numbers generated from a PRNG (256 entries); Hash the X/Y or X/Y/Z coords ("(((X*P1+Y*P2+Z*P3)*P4)>>SH)&255");
Select an index from the lookup table, and repeat with adjusted indices
for adjacent entries;
Interpolate between these.

One of the cool things about HW hashing is bit reversal of -+ the fields in
the hash--goes a long way to 'whiten' the noise signature.

--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Feb 12 16:17:50 2026

From Newsgroup: comp.arch

On 2/11/2026 8:16 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/10/2026 3:56 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/10/2026 1:04 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-02-09 5:27 p.m., BGB wrote:

On 2/9/2026 2:44 PM, EricP wrote:

------------------------------------

I found a slide presentation on the idea.

I am wondering how a LRU system would work with the skewed entries. >>>>>> I assume the entries would be shifting between the ways for LRU. So, if >>>>>> there is a different size page in one of the ways would it still work? >>>>>

One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.

In practice it is easier to implement and almost as good.

FWIW:
I ended up not using an MRU or LRU for the TLB.

My NRU is comprised of a single S-R flip-flop per PTE, with all of them
wired to an (high-fan-in) AND gate.

When all NRU-bits are set, AND gate asserts, and all reset synchronously. >>>
When TLB takes a miss, Fint-First logic chooses a unary index for the new >>> PTE. Writing (or reading) a PTE sets its S-R FF.

Simple and effective.

All pose a similar issue though in that they introduce behavior that is
not visible to the miss handler, and thus the exact state of the TLB
can't be modeled in software. Well, unless maybe there were some good
way to estimate this in SW and with a LDTLB which encoded which way to
load the TLBE into.

LDTLB: Load register pair into TLB, push-back by 1.
LDTLB{1/2/3/4}; Load into 1st/2nd/3rd/4th way of TLB.

MIPS (company not Stanford) had the TLE logic find the entry to be
written (find first), so SW just hashed a big table, checked match
and if so, inserted new PTE where HW said.

IF you are going to do a SW TLB reloader, this is the way.

The LDTLB mechanism isn't that much different from the one in SH-4, but differs in a few ways:
SH-4 used MMIO registers for the TLBE to load-in;
BJX2 uses R1:R0 as fixed registers.

Also different TLB shape:
SH-4:
64-entry fully associative;
LDTLB pushes whole TLB back by 1 entry,
adding new TLBE to the front of the list;
With oldest entry then falling off the end.
BJX2:
256 x 4-way;
Had experimented with swapping,
but ended up not using it for sake of determinism.

Both use an 0R instruction that serves the role of taking whatever TLBE
is presented to it and shoving it into the TLB in the place it goes.

One other practical difference was that it was possible to read TLB
contents via MMIO on SH-4, but this had not been carried over into my projects. Initially, I was just sort of like "WTF?" and didn't see it as particularly useful. Though now I can at least see some possible
use-cases, though maybe debatable if enough of a use case to actually
justify adding something like this.

Then again, in this case, NRU state would be 1024 bits, and could
potentially be made accessible via a special mechanism (say, an LDNRU
instruction). Would still be too bulky in this case to route it through
CPUID, and wouldn't want to get the MMIO mechanism involved.

Well, maybe could special case-it:
MMIO requests to the TLB need not make it all the way to the MMIO bus,
the TLB could just see the request passing by and be like "hey, that
one's for me" and handle it directly as it passes by (a similar
mechanism having already used for LDTLB, which sort of reach the TLB
like specially malformed memory stores).

In wide super-scalar machines all of the memory 'paths' need this logic,
and sometimes ordering of MMI/O becomes a "little tricky". ---------------------

Yeah.
I haven't yet moved past 1 memory port, for the whole reason that the
jump from 1 to 2 memory ports is steep;
It is likely though that higher throughput could be possible in theory
though if 2 memory ports could be supported.

MMI/O is potentially SLOW.

Beyond this... If we are already paying the cost of a exception handler
for TLB Miss, then some extra MMIO isn't likely a huge issue.

MIPS TLB reloader was 17 cycles (when instructions were in cache)
I have witnessed MMI/O that was slower than 1-|s !! Also the control
register bus in Athlon/Opteron was SLOW, too:: so you can't always
use control registers in lieu of MMI/O ...
-----------------

In my case, the TLB reload is far more than this (hundreds of cycles),
but mostly infrequent enough to not be a huge issue.

Though, part of this is because a 256x4 TLB tends to fit patterns a lot
better than a 64-entry full-assoc TLB would have, as long as being *far* cheaper (uses some BRAMs rather than an absurd number of LUTs).

True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).

oK.

I guess I can note that my use of LRU/MRU was most often based on STF,
in a naive case:
Access 3: Swap 3 and 2
Access 2: Swap 2 and 1
Access 1: Swap 1 and 0
Access 0: No Change

SWAP moves the whole size of PTE !! whereas NRU sets 1 bit.
Would never be allowed in a low-power implementation.
----------------

Possibly.

As noted, I ended up not using it, in favor of always pushing entries front-to-back.

Say, for example, in my current 3D engine for things like height-map:
Fill a lookup table with random numbers generated from a PRNG (256 entries); >> Hash the X/Y or X/Y/Z coords ("(((X*P1+Y*P2+Z*P3)*P4)>>SH)&255");
Select an index from the lookup table, and repeat with adjusted indices
for adjacent entries;
Interpolate between these.

One of the cool things about HW hashing is bit reversal of -+ the fields in the hash--goes a long way to 'whiten' the noise signature.

Yeah.

In software, one is mostly just stuck with "multiply by primes" and XOR
and similar.

There is some "magic" that can be gained with Morton shuffles in
hashing, but this needs special CPU instructions to have much advantage
over multiply-by-primes or similar (Morton shuffle via lookup tables
tending to have higher latency than by using a multiply).

Shift+XOR can do some special things, but naively shifting and XOR'ing multiple values together tends to be worse than
multiply-by-primes-and-add (so, worse from a hashing POV).

Bit reversal would be nice, but alas...

But, yeah, with hashed indexing into a fixed-size lookup table, one can
get a surprising variety of terrain height-maps. Well, maybe "sameness" existing on the macro scale, if someone looks at a lot of world seeds.

There are some possible improvements to the algo, but don't really want
to poke at it as changing the algo would break the existing terrain generation. Well, and I have dealt with Minecraft enough to know what
this mess looks like (though, luckily has become much less of an issue
in recent years, but in the past, it was common for new versions of the
game to break terrain generation in ways which left big ugly seams in
the world where the terrain in newly generated chunks doesn't match up
with the older chunks, ...).

One difference exists here between BT2 and BT3 engines:
BT2 used the hash value directly, and interpolated between the hashed
values;
BT3 routes it though a lookup table that is filled from the starting
seed (holding values nominally in the range of -1.0 to 1.0).

Both differ from Perlin Noise which used a more complex process, IIRC:
Normalized unit vectors;
Interpolating between unit vectors;
Taking the dot-product with another vector to get a scalar.

Made sense to drop the vector math parts, as one can get basically the
same effect by just interpolating linear values.

The original BGBTech (BT1) engine had used something more like more traditional Perlin noise.

Also, like BT2/BT3 also technically had a wrapping world, but in this
case the terrain generation didn't wrap cleanly so at the edge there
would be an abrupt seam.

So, say:

BT1: World had signed coordinates.
Theoretically wrapped, but world-seam issues were never addressed.
Engine performance and memory use was too bad to allow for much travel
(so initial terrain generation had big walls at 1km from the origin); Theoretically, if the world seam were reached, it would have been an
abrupt mismatch in the terrain generation, and teleporting the player to
the opposite side once crossed.

BT2:
Also wrapped, using 20.12 fixed-point coords for many things;
In this case, had (sorta) interpreted the coordinates as signed, but
left signed/unsigned handling as ambiguous;
Tested seam, clean wrap on seam actually worked.

BT3:
Wraps, but 16.8 (X/Y) and 9.7 (Z) in this case, mostly because I had
used a packed 64-bit format for world coords (rather than a struct). Z
was originally 8.8, but I expanded the max world height.
Couldn't go to 10.6 though as this broke the ray-casting.

So, alas, both more range and precision would be nice, but would need to rework the engine to go to doing all of this stuff with structs rather
than bit-packed values. But, if I went to a struct here, could maybe add
a world-scale option to go back to a 1024km cube for the world size.

Original reason for the packed 64-bit coords though was to make it
cheaper on the BJX2 core, but ironically in this case could also allow
for packed "__int128" here (supported in BGBCC, but not in MSVC).

All this does mean that the maximum world size in BT3 is ~ 64 km (vs
1024 km in BT2).

But, alas, reason for BT3's existence was mostly because BT2 would have
been too heavyweight to run on a 50 MHz BJX2 core. But, trying to scale
it up (to give a better experience on a desktop PC use case) kinda
conflicts with a lot of its original design goals.

Also in BT3, the world coords are currently seen as nominally unsigned,
with the spawn location right near the seam.

For expanded world height, the BT3 engine is like the BT2 engine in that
it works by stacking regions in the Z axis rather than by making regions taller. The BT2 engine could have supported fully cube-shaped worlds
though, but the BT3 engine can't currently do this.

All 3 engines differed from Minecraft though in that the world is seen
as wrapping rather than an infinite plane.

Also maybe ironically as well, in the BT2 engine, the original world
type was not an open Earth-like surface world, but rather an endless 3D
cave system. The handling of an "overworld" was itself a special case
built "inside" the 3D cave system via the terrain generator.

So, in this case, you weren't "outside" as a property of the world, but
rather because of there being a "sky" block within a certain distance
over the player's head. So, the terrain generator signaled "overworld"
by lining the top of the generated regions with sky blocks (if one were
to mine through the sky layer, they could potentially end up back in the caves).

One other gimmick that the BT2 engine had that the BT3 engine lacks, was
the ability to effectively rotate the world relative to the player (such
as inverting gravity). But, it could have also allowed for spherical,
hollow sphere, or cube/hollow cube worlds, ... In this case, the gravity direction was a property of the player (with the player's yaw/pitch/roll relative to the "local gravity" coordinate space).

Could in theory bring this back, but would need to go back to
struct-based coordinates to make use of this.

Some other differences:
BT2 used a 256x256x256 region size (16x16x16 chunks)
BT3 uses a 128x128x128 region size (8x8x8 chunks)
Reduction in region size was mostly for memory saving thing.
Does mean more loaded regions for a given draw-distance though.
BT2 had used an STF+AdRice+LZ77 based compressor for chunks.
BT3 used RP2 (byte oriented LZ, no native entropy coder)
But recently has gained an optional entropy post stage,
Used if it will get a sufficient size reduction to justify it.
Current entropy post-encoders being STF+AdRice and Range Coding.
STF+AdRice: Faster, but not as strong.
Range Coding: Better compression, but very slow.
Raw Bytes: Fastest option.
Mostly uses raw bytes in most cases
When a chunk LZ compresses to 200-300 bytes,
entropy coding doesn't do much.
If its mostly larger chunks (700-1K or more)
where it "actually does something".

Contrast:
The original BGBTech / BT1 engine had used RLE, basically the same RLE
scheme as Wolf3D and ROTT (3-byte RLEs: Escape, Length, Value; Escape
chosen to be the least-used byte). However, RLE was not particularly
effective here.

Initially, the STF+AdRice post compressor was entirely ineffective, but
I figured out that there was a tweak that could be made to the AdRice to
gain around 10% or so for entropy (putting it at around 6% worse than
the range coder), and thus "actually effective".

Note as for the lack of Huffman:
Static (table driven) Huffman is ineffective with small data.
Space needed to encode a Huffman table eats any gains,
even with Deflate's Huffman-encoded Huffman-tables trick.
Also, setting up the tables is expensive (making it slow).
Global static tables suck
And, if not a perfect fit, are worse than AdRice.
Vitter's Algorithm (Adaptive Huffman) is stupidly slow,
manages to somehow make a Range Coder seem performant.

Note that STF is part of the "secret sauce" here in that it allows
AdRice to be made to act more like a Huffman coder.

Where, in this case:
STF: Swap Towards Front
Typically to ((I*15)/16), best results IME (vs 7/8 or 31/32).
MTF: Move to Front
Symbol moves to front, shifting all other symbols back.
Both slower and less effective than STF for this.
SMTF: Swapping Move To Front.
Typically behaves like STF for higher-probability symbols.
Top 32/64/128 typically behaving like in STF.
Rotates space and swaps symbol into new index 0 if low probability.
More complex than plain STF, effectiveness depends on data type.

One way to improve accuracy of STF is to add a counter, and only doing
the swap if the counter for a symbol exceeds the count of the
destination symbol. However, the usage of counters and conditional
swapping has a fairly steep performance cost (though, still not quite as
bad as using Range Coding).

...

For textures:

BT1: Had used BTJPG, or a tweaked version of the T.81 JPEG format,
mostly tweaked to allow for optional lossless coding and alpha channel.
The lossless mode worked by switching over to RCT and SERMS-RDCT vs
YCbCr and DCT, but RDCT was slow (and only used if all quantization
parameters were 1). Alpha channel was handled by embedding the alpha
channel as a monochrome image within an APP11 tag.

IIRC, the merit of the hacky JPEG variant was that it was typically
still more space efficient and faster to decode than PNG (despite its conceptual simplicity, high performance PNG decoding is difficult, particularly if the goal is to transcode to DXT5 or BC7 or similar).

BT2: Had used my BT4B format, which was natively color-cell based,
moderately fast to decode, but inherently lossy and prone to artifacts.

BT3: Initially used DDS. Had considered using a variant of my BT5B
format, but by its design BT5B couldn't give better inherent Q/bpp than
DDS. BT5B essentially followed in similar footsteps to RPZA and MS-CRAM.

Ended up switching BT3 over to my UPIC format, which ironically was
designed mostly as a cheaper alternative to T.81 JPEG, but following a
similar general structure:
8x8 transform blocks organized into 16x16 macroblocks with 4:2:0 or
4:4:4 subsampling, though borrowing a few design tricks from the Theora
codec as well.

Like with BTJPG, was able to implement a fast transcoder to DXT1 (could
maybe also use BC7H; but for now the BT3 engine is just using DXT1).

BT2 had used exclusively sprite graphics.

BT1 had used 3D models, mostly using the AC3D and Valve SMD formats.

BT3 initially used sprites, but now also using 3D models, mostly in a
custom BMD format, generally being generated from SCAD scripts ATM.
Currently, SCAD lacks any good way to do textures or skeletal animation,
but I lack any good/obvious way to add these onto the SCAD language in a
way that wouldn't break compatibility with OpenSCAD.

Could do something like "/*[[bone("LArm")]]*/" but this sucks...

Could just do my own SCAD viewer, but, ... Well, forking the SCAD
language dilutes the merit of using SCAD, though there would still be a
common subset for untextured non-animated models.

Idea for animation in this case is mostly to assign CSG solids (or other groups/unions) to bones on a skeleton, and then by moving the bone it
moves the relevant part of the model.

In this case, the animation loops would also be procedural, could
possibly also move it over to representing these as an extension of the
SCAD language.

But, if going this route, could maybe also use the C preprocessor to
allow including other SCAD files for reusable components (well, and the "already implicitly exists" extension of also supporting integer math;
where OpenSCAD effectively lacks support for integer math operations).
Well, and C style "for()" loops (it has its own very wonky for loops).

For audio:
Ironically, thus far, all 3 engines have ended up back at ADPCM.
Thus far, have not found an audio format that shows clear advantage over
using some variation of ADPCM for sound effects (typically IMA ADPCM; technically there is also "Microsoft ADPCM" but they kinda screwed it up).

The 2nd place option is mostly A-Law waves (typically slightly better
quality, but larger size). Where, 16-bit PCM is needlessly bulky, and
8-bit PCM sounds kinda like crap (despite 11 kHz 8-bit PCM being the
"gold standard" for 90s era games).

Granted, 8 kHz IMA ADPCM isn't exactly peak audio quality either, but it
is also like 3/8 the size of said 11kHz 8-bit PCM. Can jump to 16kHz
when needed though, and still be smaller; or use 2-bit ADPCM when
smaller sizes are needed.

Seemingly, not much else competes well in 16-32 kbps territory.
Haven't been able to push much lower and still get passable quality
(8-10 kbps mostly still out of reach).

Closest I got to something usable was, ironically, doing some
statistical trickery in an ADPCM encoder and then using LZ + Range
Coding on it. Where, the idea is than the encoder can remember
previously seen patterns and then try to fit the following audio to a previously seen pattern if it is a "close-enough" match as per an error heuristic (basically, trying to make the waveform more easily LZ
compressible by doing something similar to LZ and replaying the prior
pattern if it is close enough to a match).

But, the ability to improve the LZ compression of ADPCM audio as an
encoder feature isn't exactly a break from ADPCM. Still fell short of
solidly breaking into "sub 10 kbps" territory.

Well, and for asset packing:
BT1: Tweaked ZIP
BT2: BTPAK.
BT3: WAD4.

Where, BTPAK was similar to what stuff I needed to do with ZIP to make
it "not suck" for a VFS. It had mostly used a Deflate based data
compression scheme (mostly a variant of Deflate64 tweaked to further
expand the size of the sliding window: Deflate had a 32K windows;
Deflate64 had 64K; I had expanded it to around 1MB IIRC, partly noting
as Deflate's distance encoding did not use the full byte-range for
distance codes, though IIRC did require modifying the headers, as the
original encoding for the Huffman-coded-huffman-tables didn't allow the
full byte range for the match distance table).

WAD4 is a further simplification, and exists more as a logical expansion
of the WAD2 format. WAD4 is mostly using LZ4 and RP2 compression. Wasn't
using Deflate mostly as it was difficult to make Deflate decoding cheap
(so it makes sense to, if using Deflate, to only use it if it can beat
LZ4 or RP2 by some certain percentage).

Say, for example:
Only use Deflate if it is at least 20% smaller than LZ4 or RP2.

WAD4 was itself mostly to reduce overhead and complexity if compared
with BTPAK.

One major use-case for packing and compression being to accelerate
loading, but this goal is defeated if the overhead of the decompression
is higher than that of the bandwidth from an HDD or similar (so, say, on
a desktop PC, one needs at leads 300-500 MB/s from the LZ decoder to significantly beat the HDD; or ~ GB/s if reading from an SSD).

Values are lower on BJX2 Core + SDcard, but similar idea applies:
LZ4 and RP2 can beat IO to the SDcard;
Deflate, solidly, could not (very much slower than SDcard).

Curiously, the claim was that Deflate/ZIP was originally considered as
being "very fast", but then, what exactly did it replace?...

Well, there was LZSS, but this kinda sucked in a different way:
Injecting bytes to encode literal/match bits for next 8 items;
If 8 items consumed, read another byte for 8 more bits.
If bit was 0, literal byte;
If bit was 1, read 16 bits,
with 4 bit length and 12 bit offset.
Match copying was modulo, which kinda sucked.

Wouldn't imagine it being particularly slow, but granted, it sucks
pretty hard if compared with LZ4 or similar.

Then again, I guess maybe Deflate could have been used for IO
acceleration if reading from a floppy, but the needed memory overheads
would likely have been painful for real-mode MS-DOS.

Then again, most 1980s era filesystems did not use native compression
(and then things like DoubleSpace/DriveSpace were seen as slower than
just using an uncompressed FAT filesystem).

Well, and on modern Windows, another big time cost (and a merit of using
a VFS) is the comparably high cost of opening and closing files.

Well, also printing debug messages into a console is annoyingly slow (if
one has a big "printf storm" this can itself become a bottleneck, as it
seems the Windows console has a speed of somewhere around 200
lines/second or something).

So, dunno...

...

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Feb 8 19:37:31 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

kegs@provalid.com (Kent Dickey) posted:

In article <1770238707-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

kegs@provalid.com (Kent Dickey) posted:

An advantage of hashed page tables is the TLB itself has no "intermediate"
entries. This lets the primary TLB hold more useful translations. This >> is not much of an issue if the primary TLB is >= 256 entries, but it's >> really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
page tables advantages.

Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by >a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary" >TLB (or not with explanation) ?!?

I think the 32-entry fully associative is the primary TLB. The 1024 entry is a secondary TLB. Hits in the primary TLB cause no stalls at all.
Misses in the primary probably lead to stalls (but it is possible the
6 cycle latency can be hidden sometimes), so it costs performance.

If the primary TLB (the one with no stalls) is only 4 entries, then
we can easily calculate a pretty substantial stall penalty for most programs since it will pay the 6 cycle stall a lot (and that's for a L2
TLB hit). So we make the L1 TLB as big as possible, but it cannot be made too large or its too slow, so it ends up between 32- and 128-entries
(256 tops).

I'm just pointing out the intermediate hierarchical entries (the page
table entries which are just page table pointers) need to cached in
the TLB to make future lookups faster.

Cached yes, in TLB no--generally cached in TableWalk Accelerator store.

And putting them there reduces
the size of the TLB for "useful" entries.

Which is why you don't put 'em there.

You can try to avoid this penalty by only putting page table pointer entries in the L2 TLB, but then you're just making primary TLB misses slower, especially since the L2 TLB lookup cannot scan all entries like the primary can--you have to probe
each one separately.

If you want to translate the address 0x1234_56789000 using 4KB pages, you can do a SINGLE primary associative TLB lookup, and look for:

1) Page entry for 0x1234_56789000
2) Page-table pointer for 0x1234_566xxxxx
3) Page-table pointer for 0x1234_4xxxxxxx
4) Page-table pointer for 0x120x_xxxxxxxx

(you may need do a second lookup to get page-table entries, since it's expensive to deal with multiple hits in one clock).

Another reason to put them other than in TLB. One can make all 4 accesses simultaneously, and start the tablewalk as far down the tree as possible.

For example, Ross HyperSPARC used 6 flip-flops as a TWA. When a TLB was
being accessed, the 6 flip flops would compare the VA, so when a TLB miss
took place, we already knew what layer of the page tables to start. We
used 4 flip-flops for Data TLB PTPs and 2 for code PTPs. This very tiny
TWA removed just over 1/2 of the cycles of table walking, overall.
Each flip-flop knew which layer in the TW it covered.

A modern TWA might contain a cache line from "all" the top layers and
4 cache lines from the bottom layer--and one could easily multiply that
by 2|u to capture nested paging and still TWA is smaller than a single
64E TLB while getting rid of 70% of table walk accesses to L2.

With the page-table-pointer for 0x1234_566xxxxx, you then do a single
data cache lookup to get the page entry (say 3 more clocks). So putting these page-table pointers in the L1 TLB has a big benefit to the page
walk times.

A delicate balance, especially if L2 is less than 10 cycles. ------------------------

Accounting for the overhead, it seems like almost any process will waste
at least 7 L1 TLB entries to these intermediate page table pointer
entries, so your 32-entry L1 TLB will generally hold no more than 25
useful pages.

So, you don't put 'em in TLB you put 'em in TW-Accelerator Store.

Kent

--- Synchronet 3.21d-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Feb 8 16:57:15 2026

From Newsgroup: comp.arch

On 2/5/2026 11:23 PM, Kent Dickey wrote:

In article <1770238707-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

kegs@provalid.com (Kent Dickey) posted:

An advantage of hashed page tables is the TLB itself has no "intermediate" >>> entries. This lets the primary TLB hold more useful translations. This >>> is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
page tables advantages.

Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by
a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary"
TLB (or not with explanation) ?!?

I think the 32-entry fully associative is the primary TLB. The 1024 entry
is a secondary TLB. Hits in the primary TLB cause no stalls at all.
Misses in the primary probably lead to stalls (but it is possible the
6 cycle latency can be hidden sometimes), so it costs performance.

If the primary TLB (the one with no stalls) is only 4 entries, then
we can easily calculate a pretty substantial stall penalty for most
programs since it will pay the 6 cycle stall a lot (and that's for a L2
TLB hit). So we make the L1 TLB as big as possible, but it cannot be made too large or its too slow, so it ends up between 32- and 128-entries
(256 tops).

Full associativity isn't worth it.

The gains from more associativity seem to max out at around 8-way, and
32 pages isn't nearly enough coverage (very high TLB miss rates would be unavoidable).

Better IMO to do a larger set associative TLB, maybe 8-way if one can
afford it.

At 256x4 (1024 total TLBEs), miss rate is acceptably low.

At 1024x8 (8192 total TLBEs), miss rate would essentially drop to 0 in
my test programs. Major cost though is the cost of the SRAM/BRAM.

For the L1 TLB's, had noted that simple direct mapping can work well.
Except for certain adverse scenarios, direct mapping seems like a likely strategy.

...

--- Synchronet 3.21d-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Feb 9 11:56:28 2026

From Newsgroup: comp.arch

The article requires signing in with an IEEE membership. I used to have
a student membership many years ago. I let it expire as I was not using it, this was before the internet was big. Every so often I contemplate getting one, but I am trying to keep costs down.

It used to be easy to find articles in authors' web pages, but it's
gotten harder and harder over the years.
You may want to try LibGen or Anna's Archive, instead, which does seem
to have it.

=== Stefan
--- Synchronet 3.21d-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	59
Nodes:	6 (0 / 6)
Uptime:	00:15:44
Calls:	810
Files:	1,287
Messages:	197,327

Inverted Page Tables

Who's Online

System Info