Have inverted page tables fallen out of favor?
I am thinking about using an inverted page table instead of a hierarchical page table for the current project. The advantage is that the entire page table should be able to fit into BRAM on the FPGA. It should then be possible to eliminate the TLB from the design at perhaps a small performance cost. It would reduce the LUT cost of the MMU.
Have inverted page tables fallen out of favor?
And how much OS support
is there for them?
I am thinking about using an inverted page table instead of a
hierarchical page table for the current project. The advantage is that
the entire page table should be able to fit into BRAM on the FPGA. It
should then be possible to eliminate the TLB from the design at perhaps
a small performance cost. It would reduce the LUT cost of the MMU.
Translations could work directly from the page table.
The hash function would select a page table group. Averaging about 1.2 lookups per memory access. Using the ASID xorrCOd with address bits 20 to
29 as the hash. 32k entries in the table.
Using 64kB pages to keep the page table small. 1GB RAM to manage.
Have inverted page tables fallen out of favor? And how much OS support
is there for them?
I am thinking about using an inverted page table instead of a
hierarchical page table for the current project. The advantage is that
the entire page table should be able to fit into BRAM on the FPGA. It
should then be possible to eliminate the TLB from the design at perhaps
a small performance cost. It would reduce the LUT cost of the MMU.
Translations could work directly from the page table.
The hash function would select a page table group. Averaging about 1.2 lookups per memory access. Using the ASID xorrCOd with address bits 20 to
29 as the hash. 32k entries in the table.
Using 64kB pages to keep the page table small. 1GB RAM to manage.
Have inverted page tables fallen out of favor? And how much OS support
is there for them?
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS support >>>> is there for them?
IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for a TLB. >>
For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that
cause an issue?
Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.
I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for instance. >>
A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.
On 2026-01-30 3:35 p.m., MitchAlsup wrote:
??? I think a hash table has this characteristic. Multiple VA pages can point to a single physical page using a a hash table. Is it just a performance issue? ASID is part of the hash/compare.
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash
Robert Finch-a <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS support >>>>> is there for them?
IVTs are a terrible idea.-a Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
-a-a-a-a- Dan C.
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for
a TLB.
For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that >>> cause an issue?
Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.
I guess I should take a look at the mmap() code.
The hash table is only slightly different than a giant TLB. The TLB is walked on a miss instead of walking main memory. Page faults would be handled the same way.
The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
begins using quadratic open addressing.
I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.
A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.
On 1/30/2026 3:03 PM, Robert Finch wrote:
On 2026-01-30 3:35 p.m., MitchAlsup wrote:
??? I think a hash table has this characteristic. Multiple VA pages
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash
Robert Finch-a <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS
support
is there for them?
IVTs are a terrible idea.-a Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
-a-a-a-a- Dan C.
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for
a TLB.
For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW
that
cause an issue?
Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.
I guess I should take a look at the mmap() code.
The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.
The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.
It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.
Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle:
while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.
Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly worse in the average case.
Ironically, also means that while small-scale alignment is often
preferable, medium and large scale alignment is best minimized (or, IOW, ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).
Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving
associativity relative to LUT budget), in turn still favoring the use of modulo addressing.
Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an extreme case, using latency tricks to cause a 2-way TLB to pretend to be
4 or 8 way).
Though, with something like my ringbus design this would require needing
to get clever to allow it to work (as it could not translate requests on multiple adjacent slots). Would likely need, say, a small 1-way TLB on
the front-end, followed by 2 or 3 additional slots to resolve the lookup sequentially if the 1-way TLB misses (where multiple subsequent requests that miss on the 1-way TLB and were shadowed by a prior request taking
up the "multi-way probe" resources, would need to cycle back around the
ring for another pass through the TLB, *2).
*1: Or, semi-global in a MMU that lacks true global pages, as the
existence of true global pages are actively detrimental in some
scenarios (and it is preferable instead to divide ASIDs into groups, and
a global page is only global within a group or within related groups; allowing for ASIDs to exist which will not see these "global" pages).
*2: And, while the ringbus continues to seem high-latency and stupid,
still seems to be one of the better options to balance performance and cheapness (simpler bus designs being painfully slow, and more advanced FIFO/queue based buses being far more complex and expensive...).
I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.
A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.
On 2026-01-30 4:44 p.m., BGB wrote:
On 1/30/2026 3:03 PM, Robert Finch wrote:
On 2026-01-30 3:35 p.m., MitchAlsup wrote:
??? I think a hash table has this characteristic. Multiple VA pages
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash >>>> table based one. I tend to think of them as the same thing. I do not >>>> think anybody would use a plain IVT.
Robert Finch-a <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS
support
is there for them?
IVTs are a terrible idea.-a Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
-a-a-a-a- Dan C.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for >>>> a TLB.
For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following >>>> the challenge to using them for shared memory? Multiple VPN-> to the >>>> same PPN are possible. Is it the freeing up of physical pages in SW >>>> that
cause an issue?
Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.
I guess I should take a look at the mmap() code.
The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.
The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.
It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.
It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.
Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle: while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.
Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly worse in the average case.
I think clustering give the table more associativity. Up to eight VAs
can all hash to the same cluster. Almost as if it were eight-way associative.
I think clustering is key to using an inverted hash-table. I got this
from studying the PowerPC.
Ironically, also means that while small-scale alignment is often preferable, medium and large scale alignment is best minimized (or, IOW, ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).
Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving associativity relative to LUT budget), in turn still favoring the use of modulo addressing.
Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an extreme case, using latency tricks to cause a 2-way TLB to pretend to be
4 or 8 way).
Not too worried about latency as long as it is reasonable. Every step is another eight ways associative due to clustering.
On 2026-01-30 3:35 p.m., MitchAlsup wrote:
??? I think a hash table has this characteristic. Multiple VA pages can point to a single physical page using a a hash table. Is it just a performance issue? ASID is part of the hash/compare.
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS support >>>>> is there for them?
IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for
a TLB.
For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that >>> cause an issue?
Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.
I guess I should take a look at the mmap() code.
The hash table is only slightly different than a giant TLB. The TLB is walked on a miss instead of walking main memory. Page faults would be handled the same way.
The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
begins using quadratic open addressing.
I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.
A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.
Robert Finch wrote:
??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.
Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare,
if the PTE Global flag is clear then ASID is included in the VA compare.
If the G flag is set on OS kernel pages then the effect is to make the
kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's.
The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.
But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.
Robert Finch wrote:
On 2026-01-30 3:35 p.m., MitchAlsup wrote:
??? I think a hash table has this characteristic. Multiple VA pages
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash
Robert Finch-a <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS
support
is there for them?
IVTs are a terrible idea.-a Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
-a-a-a - Dan C.
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for
a TLB.
For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW
that
cause an issue?
Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.
Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare,
if the PTE Global flag is clear then ASID is included in the VA compare.
If the G flag is set on OS kernel pages then the effect is to make the
kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's.
The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.
But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.
I guess I should take a look at the mmap() code.
The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.
Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT. Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a large TLB. As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.
The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.
Also you can't guarantee that your whole program will fit into an single inverted/hash table of a predetermined size because you don't know in
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.
The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.
I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.
A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.
What is your MMU currently doing that uses 9000 LUTs?
On 2026-01-30 4:44 p.m., BGB wrote:
On 1/30/2026 3:03 PM, Robert Finch wrote:
It isn't obvious (at least to me) how what you are describing is muchIt is not a lot different. On a translation miss the table is walked
different from a SW managed TLB.
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated >after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.
On 2026-01-31 12:31 p.m., EricP wrote:
Robert Finch wrote:I am not sure about getting Linux working. I may just use my own OS. Skimming through the mmap() code has given me an idea of just how
The hash table is only slightly different than a giant TLB. The TLB
is walked on a miss instead of walking main memory. Page faults would
be handled the same way.
Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT.
Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a large
TLB.
As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.
The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.
complex the software is.
Also you can't guarantee that your whole program will fit into an singleCollisions will not push other entries out, they just bounce to a
inverted/hash table of a predetermined size because you don't know in
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.
different empty entry. If one cannot be found then it is an out-of-memory.
Testing some. Testing just allocates pages without freeing them so it is
a bit unrealistic.
The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
set). Average number of bounces: 1.8 per lookup. This was with random
ASIDs and addresses.
The table can be clocked at twice the CPU rate.
I have been wondering how to collect and use stats to control things.
The entries in the IVT for ASIDs no longer in use could be cleared automatically by a background HW process.
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 4:44 p.m., BGB wrote:
On 1/30/2026 3:03 PM, Robert Finch wrote:It is not a lot different. On a translation miss the table is walked
On 2026-01-30 3:35 p.m., MitchAlsup wrote:
??? I think a hash table has this characteristic. Multiple VA pages
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash >>>>>> table based one. I tend to think of them as the same thing. I do not >>>>>> think anybody would use a plain IVT.
Robert Finch-a <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS >>>>>>>> support
is there for them?
IVTs are a terrible idea.-a Consider the challenge of using them >>>>>>> to share memory between virtual address spaces in a
multiprocessing system, efficiently.
-a-a-a-a- Dan C.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked >>>>>> in hardware very quickly, one cycle per step as opposed to walking the >>>>>> page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for >>>>>> a TLB.
For the IVT implementation the table is twice the size needed to cover >>>>>> the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following >>>>>> the challenge to using them for shared memory? Multiple VPN-> to the >>>>>> same PPN are possible. Is it the freeing up of physical pages in SW >>>>>> that
cause an issue?
Consider mmap(file) in multiple different processes at different VAs. >>>>> So, now one has multiple VA pages pointing at one Physical page.
can point to a single physical page using a a hash table. Is it just a >>>> performance issue? ASID is part of the hash/compare.
I guess I should take a look at the mmap() code.
The hash table is only slightly different than a giant TLB. The TLB is >>>> walked on a miss instead of walking main memory. Page faults would be
handled the same way.
The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.
It isn't obvious (at least to me) how what you are describing is much
different from a SW managed TLB.
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated
after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.
Main difference I think was that I ended up mostly settling on modulo
addressing rather than hashing, but the reasoning here is more subtle:
while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.
Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly
worse in the average case.
I think clustering give the table more associativity. Up to eight VAs
can all hash to the same cluster. Almost as if it were eight-way
associative.
I think clustering is key to using an inverted hash-table. I got this
from studying the PowerPC.
Ironically, also means that while small-scale alignment is often
preferable, medium and large scale alignment is best minimized (or, IOW, >>> ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).
Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving
associativity relative to LUT budget), in turn still favoring the use of >>> modulo addressing.
Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an
extreme case, using latency tricks to cause a 2-way TLB to pretend to be >>> 4 or 8 way).
Not too worried about latency as long as it is reasonable. Every step is
another eight ways associative due to clustering.
Be careful !!
Matrix300 (SPEC89) uses all 4 transposes of DGEMM. The 4th transpose takes
a TLB miss every 4th Load; or once every 2-cycles on the 6-wide machine.
A 32-entry fully associative TLB would miss every other cycle, while a 256-entry Direct Mapped TLB would never take a miss.
Robert Finch wrote:
On 2026-01-31 12:31 p.m., EricP wrote:
Robert Finch wrote:I am not sure about getting Linux working. I may just use my own OS.
The hash table is only slightly different than a giant TLB. The TLB
is walked on a miss instead of walking main memory. Page faults
would be handled the same way.
Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT. >>> Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a
large TLB.
As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.
The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.
Skimming through the mmap() code has given me an idea of just how
complex the software is.
It was just an example of how one can use a page table to emulate
any other kind of page table.
Also you can't guarantee that your whole program will fit into an single >>> inverted/hash table of a predetermined size because you don't know inCollisions will not push other entries out, they just bounce to a
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.
different empty entry. If one cannot be found then it is an out-of-
memory.
Testing some. Testing just allocates pages without freeing them so it
is a bit unrealistic.
The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
set). Average number of bounces: 1.8 per lookup. This was with random
ASIDs and addresses.
The table can be clocked at twice the CPU rate.
One might make the hash function controllable so you can optimize the
bounce length for different programs. Also the max bounce limit should
be a dynamic HW register and not hard wired.
I have been wondering how to collect and use stats to control things.
The entries in the IVT for ASIDs no longer in use could be cleared
automatically by a background HW process.
That implies some kind of ASID allocator to track which are in-use.
It also implies a whole IVT scan as each ASID is freed,
or maintain a pending free list and a scavenger process.
Simpler to have create process assign a new ASID by incrementing
an unsigned up counter that starts at 1. When you hit its max then reset
to 1 and clear the whole IVT for all non-zero ASID entries in one pass.
EricP wrote:
Robert Finch wrote:
??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.
Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare, if the PTE Global flag is clear then ASID is included in the VA compare.
If the G flag is set on OS kernel pages then the effect is to make the kernel virtual pages one single address space on all cores with no associated ASID, while process virtual spaces reside in their own ASID's.
The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.
But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.
One could achieve the same effect as the CAM approach by picking off the range of addresses assigned to the OS and assigning them to ASID = 0,
and having the process range use the curent non-zero ASID register.
If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
and we added a 12-bit ASID then that gives a 76-bit virtual space
with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
1..4095 are for processes.
The MMU has a register MMU_OS_START that hold the OS unsigned lower
start address and a comparator. If the VA is >= MMU_OS_START then
it uses an ASID = 0, otherwise it comes from the MMU_ASID register.
This 12-bit value is concatenated with the 52-bit VPN to give a
64-bit combined ASID-VPN index value to lookup in the TLB.
The OS software that assigned ASIDs also skips assigning 0 to any process.
In article <10lji2u$2haq2$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:
On 2026-01-30 4:44 p.m., BGB wrote:
On 1/30/2026 3:03 PM, Robert Finch wrote:
[ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
be searched before moving to the next entry].
It isn't obvious (at least to me) how what you are describing is muchIt is not a lot different. On a translation miss the table is walked
different from a SW managed TLB.
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated >after 30 steps in the walk though. That is a miss on 240 possible >locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.
I'm going to guess your physical address space is just 32-bits, otherwise this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
44-bit physical address space would mean 2^28 entries, with each entry
being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
entries of 64-bits each.
A page table which doesn't scale is problematic. Especially if it needs
OS support. Who wants to invest effort in a dead-end idea?
Your hashed page table has issues which I think you should address:
1) Supporting multiple page sizes is a pain. If you have 2MB pages and
64KB pages, there are 32 64KB pages for each 2MB page. So do you
lazily load the hashtable only on misses, or load all 32 aliases
at once?
2) It's common to just do a single hash lookup, then trap to software. Doing
probing is a big pain. The problems come from removing entries.
If you page out a page, it leaves gaps in the table. When you do
a hardware lookup, and find no valid pages at the first entry, this
doesn't tell you anything! The obvious solution is ALL hardware
lookups must do the maximum searching before declaring a hash-table
miss--which you say is 30 steps, checking 240 entries. I'm not sure
there's a better solution than that, but this is a pretty big
penalty for a TLB miss.
3) It's very complex to efficiently update the hash table. On inserting
a new entry, where do you put it? On one hand, a "move-to-front"
strategy seems like a good idea--newly added pages seem like they
should get lookup priority. But how do you do this? It looks like
each software handled miss has a TON of work to do. It seems like
you need to re-hash each entry you consider moving, to make sure it
stays on its proper hash chain. And keeping the OS page structures
in sync looks like even more work.
4) The global nature of the hash table means multiple threads probably need
a lock to change it. So misses to software have to get a lock before
making any hashtable changes. This is very expensive. This is
especially bad when a process ends--all of its translations need to
be removed, and so all page table misses are blocked while that
happens. This is bad.
A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used.
It's indexed by a hash of ASID+VA, and does a single probe. It might be useful to do two hash tables: one for large pages, and one for small pages (with 2 lookups).
On any miss, the hash entry to be updated is easy to find for software.
But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you reverse bits and do other stuff that may be hard for software to calculate, you make it easy).
This gives you the benefits of a hardware table walker with less trouble.
An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.
Kent--- Synchronet 3.21b-Linux NewsLink 1.2
On 2026-02-01 10:28 a.m., Kent Dickey wrote:
In article <10lji2u$2haq2$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:
On 2026-01-30 4:44 p.m., BGB wrote:
On 1/30/2026 3:03 PM, Robert Finch wrote:
[ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to be searched before moving to the next entry].
It isn't obvious (at least to me) how what you are describing is muchIt is not a lot different. On a translation miss the table is walked
different from a SW managed TLB.
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated >> after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.
I'm going to guess your physical address space is just 32-bits, otherwise
Yes, this is for a 29-bit DRAM address space. I was going to use 2^13 entries with 256kB (or possibly smaller) pages. It is only usable with a smaller address space that can fit into BRAM.
this table is huge. I think you said you have 64KB pages (but then you said
256KB pages--which is just too big, so I'll ignore that), which for a 44-bit physical address space would mean 2^28 entries, with each entry being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16 entries of 64-bits each.
A page table which doesn't scale is problematic. Especially if it needs
OS support. Who wants to invest effort in a dead-end idea?
Your hashed page table has issues which I think you should address:
1) Supporting multiple page sizes is a pain. If you have 2MB pages and
64KB pages, there are 32 64KB pages for each 2MB page. So do you
lazily load the hashtable only on misses, or load all 32 aliases
at once?
Supporting multiple page sized is a pain even for a hierarchical table. Multiple TLBs are used, so maybe multiple hash tables could be used. But
I was not planning on supporting more than one page size.
2) It's common to just do a single hash lookup, then trap to software. Doing
probing is a big pain. The problems come from removing entries.
If you page out a page, it leaves gaps in the table. When you do
a hardware lookup, and find no valid pages at the first entry, this
doesn't tell you anything! The obvious solution is ALL hardware
Thanks, I had forgotten the issue with removes. Inserts are easy without removes. I wonder if the removes could be done in bulk similar to a
garbage collection. Just mark the entry as deleted rather than leaving a gap, then have a process that periodically rebuilds the table. For the
table size I am planning on using garbage collecting it may not be too bad.
The table probes until it finds the entry, or an empty entry in a group where the lookup should be. It stops after a certain number of probes if there are no empty entries in the group.
The lookup returns the group and the group empty status at which there
is an empty entry if the lookup is not found. So, adding a new entry
should be straightforward.
lookups must do the maximum searching before declaring a hash-table
miss--which you say is 30 steps, checking 240 entries. I'm not sure
there's a better solution than that, but this is a pretty big
penalty for a TLB miss.
30 steps (in HW) is about as fast as a divide operation. It only works
for BRAM, it was DRAM it would be many clocks per lookup.
3) It's very complex to efficiently update the hash table. On inserting
a new entry, where do you put it? On one hand, a "move-to-front"
strategy seems like a good idea--newly added pages seem like they
should get lookup priority. But how do you do this? It looks like
each software handled miss has a TON of work to do. It seems like
I was not planning on anything that complex. The miss returns the group
at which the entry should be located, so SW should know where to insert.
For testing, the test bench fakes inserting pages. The issues rise when
the table gets full.
you need to re-hash each entry you consider moving, to make sure it
stays on its proper hash chain. And keeping the OS page structures
in sync looks like even more work.
4) The global nature of the hash table means multiple threads probably need
a lock to change it. So misses to software have to get a lock before
making any hashtable changes. This is very expensive. This is
especially bad when a process ends--all of its translations need to
be removed, and so all page table misses are blocked while that
happens. This is bad.
Definitely an issue. May have only the system (one) thread able to
update it.
A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash
of ASID+VA, and does a single probe. It might be useful to do two hash tables: one for large pages, and one for small pages (with 2 lookups).
On any miss, the hash entry to be updated is easy to find for software. But: there's effectively no associativity--but just increase the hash table's
size if software detects that you are getting conflicts. Pick a good
hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you reverse bits and do other stuff that may be hard for software to calculate, you make it easy).
This gives you the benefits of a hardware table walker with less trouble.
An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
is not much of an issue if the primary TLB is >= 256 entries, but it's really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
page tables advantages.
Kent
On 2026-02-01 12:49 p.m., EricP wrote:
Simpler to have create process assign a new ASID by incrementingA garbage collection approach may work for removed entries. It might be
an unsigned up counter that starts at 1. When you hit its max then reset
to 1 and clear the whole IVT for all non-zero ASID entries in one pass.
a complicated state machine though to do in HW. It could look for
invalid or entries with defunct ASIDs.
kegs@provalid.com (Kent Dickey) posted:
An advantage of hashed page tables is the TLB itself has no "intermediate" >> entries. This lets the primary TLB hold more useful translations. This
is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
page tables advantages.
Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by
a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary"
TLB (or not with explanation) ?!?
Robert Finch <robfi680@gmail.com> writes:
On 2026-02-01 12:49 p.m., EricP wrote:
Simpler to have create process assign a new ASID by incrementingA garbage collection approach may work for removed entries. It might be
an unsigned up counter that starts at 1. When you hit its max then reset >>> to 1 and clear the whole IVT for all non-zero ASID entries in one pass.
a complicated state machine though to do in HW. It could look for
invalid or entries with defunct ASIDs.
AFAIK the approach outlined by EricP has been used in commercial
hardware. It's simple, and if there are enough ASIDs, cheap (because
resets are rare). So better add more bits to the ASIDs than introduce complicated machinery for reclaiming ASIDs.
- anton
EricP <ThatWouldBeTelling@thevillage.com> posted:
EricP wrote:
Robert Finch wrote:One could achieve the same effect as the CAM approach by picking off the
??? I think a hash table has this characteristic. Multiple VA pagesBe careful about ASID's - they aren't just part of the virtual address
can point to a single physical page using a a hash table. Is it just a >>>> performance issue? ASID is part of the hash/compare.
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare, >>> if the PTE Global flag is clear then ASID is included in the VA compare. >>>
If the G flag is set on OS kernel pages then the effect is to make the
kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's. >>>
The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, >>> otherwise check if ASID is matched.
But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.
range of addresses assigned to the OS and assigning them to ASID = 0,
and having the process range use the curent non-zero ASID register.
If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
and we added a 12-bit ASID then that gives a 76-bit virtual space
with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
1..4095 are for processes.
SPARC V8 had 13-bit ASIDs and found that this was way to short for
the market SUN was targeting. SPARC V9 went with 16-bit ASIDs and
I think this is as SHORT as an ASID can be for "reasonable" future
proofing.
In article <1770238707-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
kegs@provalid.com (Kent Dickey) posted:
An advantage of hashed page tables is the TLB itself has no "intermediate" >> entries. This lets the primary TLB hold more useful translations. This >> is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
page tables advantages.
Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by
a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary"
TLB (or not with explanation) ?!?
I think the 32-entry fully associative is the primary TLB. The 1024 entry
is a secondary TLB. Hits in the primary TLB cause no stalls at all.
Misses in the primary probably lead to stalls (but it is possible the
6 cycle latency can be hidden sometimes), so it costs performance.
If the primary TLB (the one with no stalls) is only 4 entries, then
we can easily calculate a pretty substantial stall penalty for most
programs since it will pay the 6 cycle stall a lot (and that's for a L2
TLB hit). So we make the L1 TLB as big as possible, but it cannot be made too large or its too slow, so it ends up between 32- and 128-entries
(256 tops).
I'm just pointing out the intermediate hierarchical entries (the page
table entries which are just page table pointers) need to cached in
the TLB to make future lookups faster.
And putting them there reduces
the size of the TLB for "useful" entries.
You can try to avoid this
penalty by only putting page table pointer entries in the L2 TLB, but then you're just making primary TLB misses slower, especially since the L2 TLB lookup cannot scan all entries like the primary can--you have to probe
each one separately.
If you want to translate the address 0x1234_56789000 using 4KB pages, you
can do a SINGLE primary associative TLB lookup, and look for:
1) Page entry for 0x1234_56789000
2) Page-table pointer for 0x1234_566xxxxx
3) Page-table pointer for 0x1234_4xxxxxxx
4) Page-table pointer for 0x120x_xxxxxxxx
(you may need do a second lookup to get page-table entries, since it's expensive to deal with multiple hits in one clock).
With the page-table-pointer for 0x1234_566xxxxx, you then do a single
data cache lookup to get the page entry (say 3 more clocks). So putting these page-table pointers in the L1 TLB has a big benefit to the page
walk times.
Accounting for the overhead, it seems like almost any process will waste
at least 7 L1 TLB entries to these intermediate page table pointer
entries, so your 32-entry L1 TLB will generally hold no more than 25
useful pages.
Kent--- Synchronet 3.21b-Linux NewsLink 1.2
In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS support
is there for them?
IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS support
is there for them?
IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for a TLB.
For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that cause an issue?
I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for instance.
A hierarchical page table is a better solution, but I was looking for something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.
In article <10lji2u$2haq2$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:
On 2026-01-30 4:44 p.m., BGB wrote:
On 1/30/2026 3:03 PM, Robert Finch wrote:
[ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
be searched before moving to the next entry].
It isn't obvious (at least to me) how what you are describing is muchIt is not a lot different. On a translation miss the table is walked
different from a SW managed TLB.
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated
after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.
I'm going to guess your physical address space is just 32-bits, otherwise
this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
44-bit physical address space would mean 2^28 entries, with each entry
being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
entries of 64-bits each.
A page table which doesn't scale is problematic. Especially if it needs
OS support. Who wants to invest effort in a dead-end idea?
Your hashed page table has issues which I think you should address:
1) Supporting multiple page sizes is a pain. If you have 2MB pages and
64KB pages, there are 32 64KB pages for each 2MB page. So do you
lazily load the hashtable only on misses, or load all 32 aliases
at once?
2) It's common to just do a single hash lookup, then trap to software. Doing
probing is a big pain. The problems come from removing entries.
If you page out a page, it leaves gaps in the table. When you do
a hardware lookup, and find no valid pages at the first entry, this
doesn't tell you anything! The obvious solution is ALL hardware
lookups must do the maximum searching before declaring a hash-table
miss--which you say is 30 steps, checking 240 entries. I'm not sure
there's a better solution than that, but this is a pretty big
penalty for a TLB miss.
3) It's very complex to efficiently update the hash table. On inserting
a new entry, where do you put it? On one hand, a "move-to-front"
strategy seems like a good idea--newly added pages seem like they
should get lookup priority. But how do you do this? It looks like
each software handled miss has a TON of work to do. It seems like
you need to re-hash each entry you consider moving, to make sure it
stays on its proper hash chain. And keeping the OS page structures
in sync looks like even more work.
4) The global nature of the hash table means multiple threads probably need
a lock to change it. So misses to software have to get a lock before
making any hashtable changes. This is very expensive. This is
especially bad when a process ends--all of its translations need to
be removed, and so all page table misses are blocked while that
happens. This is bad.
A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash of ASID+VA, and does a single probe. It might be useful to do two hash tables: one for large pages, and one for small pages (with 2 lookups).
On any miss, the hash entry to be updated is easy to find for software.
But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you reverse bits and do other stuff that may be hard for software to calculate, you make it easy).
This gives you the benefits of a hardware table walker with less trouble.
An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.
Kent
MitchAlsup wrote:
EricP <ThatWouldBeTelling@thevillage.com> posted:
EricP wrote:
Robert Finch wrote:One could achieve the same effect as the CAM approach by picking off the >> range of addresses assigned to the OS and assigning them to ASID = 0,
??? I think a hash table has this characteristic. Multiple VA pages >>>> can point to a single physical page using a a hash table. Is it just a >>>> performance issue? ASID is part of the hash/compare.Be careful about ASID's - they aren't just part of the virtual address >>> in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare,
if the PTE Global flag is clear then ASID is included in the VA compare. >>>
If the G flag is set on OS kernel pages then the effect is to make the >>> kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's. >>>
The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, >>> otherwise check if ASID is matched.
But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.
and having the process range use the curent non-zero ASID register.
If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
and we added a 12-bit ASID then that gives a 76-bit virtual space
with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
1..4095 are for processes.
SPARC V8 had 13-bit ASIDs and found that this was way to short for
the market SUN was targeting. SPARC V9 went with 16-bit ASIDs and
I think this is as SHORT as an ASID can be for "reasonable" future proofing.
It also depends on whether ASIDs are assigned globally for a whole system
or locally for each core. Local assignment for each core needs less bits.
For globally assigned ASID 16 bits sounds too small for today -
they recycle too often.
When ASIDs were first introduced on x86 I remember a certain amount
of concern for the effects they would have on the fully assoc TLB.
The 32-bit x86 has a 20-bit Virtual Page Number so adding a 12-bit ASID
would increase the TLB size and power usage by 60%.
Back then I spent some time looking at how to do local 4-bit ASIDs per-core. That is, each core has its own 4-bit ASID counter and each process
remembers which ASID its used last on each core.
Each core's TLB retains entries for up to the last 16 processes to run threads on that core, and each core recycles its ASID independently.
The 4-bit size was chosen based on the assumption that the odds are small
in a TLB containing useful entries beyond the last 16 processes to run.
This eliminates many the global ASID shoot downs but 4 bits is too small as it would recycle too quickly causing too frequent local ASID invalidates.
Combining all the ideas, 16+ bit ASID, locally assigned per core,
with a batch TLB ASID invalidate mechanism using a bit mask for matching ASIDs so I can invalidate all ids in just one of the groups I mentioned elsewhere; GGxx_xxxx_xxxx_xxxx where GG is the 2-bit group number.
After getting bogged down testing the hash table, I decided to
shelf it for now. And...
I re-wrote Qupls4 TLB for hierarchical page tables. It is much
cleaner code now. The TLB is taking up only about 600 LUTs and 8
BRAMs now. It is also simpler being only a single level TLB.
Three copies of the TLB would be needed.
Thinking about adding caching of translation steps to avoid main
memory access.
On 2/5/2026 11:23 PM, Kent Dickey wrote:[snip]
In article <1770238707-5857@newsgrouper.org>,
I think the 32-entry fully associative is the primary TLB.
The 1024 entry
is a secondary TLB.-a Hits in the primary TLB cause no stalls
at all.
Misses in the primary probably lead to stalls (but it is
possible the
6 cycle latency can be hidden sometimes), so it costs
performance.
If the primary TLB (the one with no stalls) is only 4 entries,
then
we can easily calculate a pretty substantial stall penalty for
most
programs since it will pay the 6 cycle stall a lot (and that's
for a L2
TLB hit).-a So we make the L1 TLB as big as possible, but it
cannot be made
too large or its too slow, so it ends up between 32- and
128-entries
(256 tops).
Full associativity isn't worth it.
The gains from more associativity seem to max out at around
8-way, and 32 pages isn't nearly enough coverage (very high TLB
miss rates would be unavoidable).
Anton Ertl wrote:
Robert Finch <robfi680@gmail.com> writes:
On 2026-02-01 12:49 p.m., EricP wrote:
Simpler to have create process assign a new ASID byA garbage collection approach may work for removed entries.
incrementing
an unsigned up counter that starts at 1. When you hit its
max then reset
to 1 and clear the whole IVT for all non-zero ASID entries
in one pass.
It might be a complicated state machine though to do in HW.
It could look for invalid or entries with defunct ASIDs.
AFAIK the approach outlined by EricP has been used in commercial
hardware.-a It's simple, and if there are enough ASIDs, cheap
(because
resets are rare).-a So better add more bits to the ASIDs than
introduce complicated machinery for reclaiming ASIDs.
- anton
The method I described before was a bit brute force, as it looks
like
it has to stall the whole system while it erases the old ASIDs.
It is inefficient as it erases all ASIDs including ones recently
assigned.
One issue with invalidating ASIDs is that the L1-TLB may be fully associative, the L2-TLB often are set associative.
While a a full assoc TLB can invalidate multiple matching
entries in
1 clock, the set assoc TLB requires a scan doing a
read-compare-write
operation on all the ways in each set, taking 2 or more clocks
per set.
So an L2-TLB with 512 sets could take 1024 clocks or more to
complete
an ASID shoot down.
This would also cause all running processes to have a flurry of
TLB misses
as they reload the entries just purged.
I would want something soother, continuous, that usually didn't
need to
stall the whole system, and only recycles the oldest ASIDs.
Something like, split the ASIDs into 4 groups using the high 2 msb.
As the ASID counter increments from group N to group N+1, it
initiates an
invalidate of group N+2, such that it retains the recent ASIDs
in group N
and N-1. And this is circular so when it starts assigning in G3 it
initiates the invalidate of G0. The invalidate is also
asynchronous,
so the system can continue to run and assign ASIDs while it
invalidates
the next group ahead (there are some interlocks waiting for ACK's).
e.g. Say the ASID counter is 7 bits so 4 groups of 32 IDs,
called G0..G3.
The current assignment ASID counter = 31 in G1.
When its increments to 32 in G2, it begins invalidating all
group G3 ids
in the range 96..128. Any processes currently using ASIDs in the
G3 range
have their ASID invalidated, and when they next run will assign
a new id
in the current assignment group. All processes in groups G0 and
G1 are
unaffected by this ASID recycling. Core broadcasts an IPI
Interprocessor
Interrupt to all other cores to shoot down all group G3 ASIDs.
On 2/8/26 5:57 PM, BGB wrote:
On 2/5/2026 11:23 PM, Kent Dickey wrote:[snip]
In article <1770238707-5857@newsgrouper.org>,
I think the 32-entry fully associative is the primary TLB. The 1024
entry
is a secondary TLB.-a Hits in the primary TLB cause no stalls at all.
Misses in the primary probably lead to stalls (but it is possible the
6 cycle latency can be hidden sometimes), so it costs performance.
If the primary TLB (the one with no stalls) is only 4 entries, then
we can easily calculate a pretty substantial stall penalty for most
programs since it will pay the 6 cycle stall a lot (and that's for a L2
TLB hit).-a So we make the L1 TLB as big as possible, but it cannot be
made
too large or its too slow, so it ends up between 32- and 128-entries
(256 tops).
Full associativity isn't worth it.
The gains from more associativity seem to max out at around 8-way, and
32 pages isn't nearly enough coverage (very high TLB miss rates would
be unavoidable).
Full associativity is not just about conflict misses. Such also
facilitates supporting multiple page sizes in the same TLB. Different
page sizes naturally use different indexing bits, so having zero
indexing bits means that any page can occupy any entry.
Andr|- Seznec proposed a way to use (overlaid) skewed associativity to support multiple page sizes ("Concurrent Support of Multiple Page Sizes
On a Skewed Associative TLB", 2003). This has the limitation of
requiring shared indexing bits so that the different page size probes
(done in parallel) can be mapped to different banks. This also requires accessing more entries with limited associativity (the skewed
associative conflict avoids reduces the miss rate that this would
otherwise cause as well as balances utilization among page sizes).
AMD used full associativity to support coalescing of 4KiB pages into
16KiB (32KiB in Zen+ and earlier) TLB entries. (An ordinary 8-way associative TLB indexed for 16KiB "pages" would map all uncoalescable
4KiB pages in a 16KiB chunk to the same set, increasing conflict
misses.) It might have been possible to index by the shared bits of coalesced and base page sizes and use a CAM-based predictor to choose whether to index the rest by base page or by coalesced page, but I
suspect such would not be that much smaller than using full
associativity and likely slower more power consuming as well as
vulnerable to false conflicts from the predictor limiting which PTEs can share a set.
Full associativity may also have been important for Itanium 2's
prevalidated cache, where each cache tag had one hot bit indicating
which TLB entry applied to it. This allowed faster tag check and flash invalidation of (write-through) cache entries when a TLB entry was
evicted. The number of bits in the cache tag limits the TLB capacity,
which makes associativity more important, and TLB misses are more
expensive since they might invalidate useful cache blocks. (Write
through may have been desirable for reliability or been a consequence of seeking a low latency mechanism for L1 accesses.)
Of course, the tradeoffs are different for an FPGA implementation.
On 2026-02-08 9:28 p.m., Paul Clayton wrote:
The article requires signing in with an IEEE membership. I used to have
Andr|- Seznec proposed a way to use (overlaid) skewed associativity to
support multiple page sizes ("Concurrent Support of Multiple Page
Sizes On a Skewed Associative TLB", 2003). This has the limitation of
requiring shared indexing bits so that the different page size probes
(done in parallel) can be mapped to different banks. This also
requires accessing more entries with limited associativity (the skewed
associative conflict avoids reduces the miss rate that this would
otherwise cause as well as balances utilization among page sizes).
a student membership many years ago. I let it expire as I was not using
it, this was before the internet was big. Every so often I contemplate getting one, but I am trying to keep costs down.
Robert Finch wrote:
On 2026-02-08 9:28 p.m., Paul Clayton wrote:
The article requires signing in with an IEEE membership. I used to
Andr|- Seznec proposed a way to use (overlaid) skewed associativity to
support multiple page sizes ("Concurrent Support of Multiple Page
Sizes On a Skewed Associative TLB", 2003). This has the limitation of
requiring shared indexing bits so that the different page size probes
(done in parallel) can be mapped to different banks. This also
requires accessing more entries with limited associativity (the
skewed associative conflict avoids reduces the miss rate that this
would otherwise cause as well as balances utilization among page sizes). >>>
have a student membership many years ago. I let it expire as I was not
using it, this was before the internet was big. Every so often I
contemplate getting one, but I am trying to keep costs down.
A bit of poking about finds her home page
https://team.inria.fr/pacap/members/andre-seznec/
which links to
https://team.inria.fr/pacap/members/andre-seznec/cache-architecture- research-2/
and to her paper
Concurrent Support of Multiple Page Sizes on a Skewed Associative TLB, 2003 http://www.irisa.fr/caps/people/seznec/SKEWEDTLBperso.pdf
On 2/9/2026 2:44 PM, EricP wrote:
Robert Finch wrote:
On 2026-02-08 9:28 p.m., Paul Clayton wrote:
The article requires signing in with an IEEE membership. I used to
Andr|- Seznec proposed a way to use (overlaid) skewed associativity
to support multiple page sizes ("Concurrent Support of Multiple Page
Sizes On a Skewed Associative TLB", 2003). This has the limitation
of requiring shared indexing bits so that the different page size
probes (done in parallel) can be mapped to different banks. This
also requires accessing more entries with limited associativity (the
skewed associative conflict avoids reduces the miss rate that this
would otherwise cause as well as balances utilization among page
sizes).
have a student membership many years ago. I let it expire as I was
not using it, this was before the internet was big. Every so often I
contemplate getting one, but I am trying to keep costs down.
A bit of poking about finds her home page
https://team.inria.fr/pacap/members/andre-seznec/
which links to
https://team.inria.fr/pacap/members/andre-seznec/cache-architecture-
research-2/
and to her paper
Concurrent Support of Multiple Page Sizes on a Skewed Associative TLB,
2003
http://www.irisa.fr/caps/people/seznec/SKEWEDTLBperso.pdf
My usual thing here was:
In mixed page-size configurations, the MMU is set to the smallest in-use page size;
When set to larger page sizes, a few of the bits that were low-order
bits with a smaller page size change to being high-order bits with a
larger page size (all other bits remaining in the same place).
So, say:
If using both 4K and 16K pages, MMU is set for 4K;
But, can set it for 16K if no 4K pages are being used;
Or, 64K if not 4K or 16K pages are being used.
With normal page tables, there is a defined "basic" page size and any
other page sizes would exist as some multiple of this size. There is a slight inefficiency when dealing with larger pages with a smaller basic
page size, but this was a reasonable tradeoff.
If the page size changes between ASIDs, this isn't a huge issue as
normally ASIDs can't see each other's pages anyways, except for global pages, but if using page-tables then any global pages shared between
page tables would implicitly need the same page size.
Seemed "obvious enough".
Paper seems to have the idea of subdividing associativity if multiple
page sizes are in use at the same time. I guess this works, if one has enough associativity.
In this case 4 active page sizes would require at least an 8-way TLB to avoid "shooting oneself in the foot" though. Here, 2 active page sizes
could be used with a 4-way TLB, where at least 2-way is needed (assuming modulo addressing) to maintain stability.
This also works I guess.
Ironically, a similar downside existed in my case when using Addr96
modes, where this also came at the cost of reducing TLB associativity (though, this cost can be reduced when using primarily 48-bit
addressing, *).
*: Say, for example, if pages are either part of:
-a The "null quadrant" (low 48 bit range of the 96-bit VAS);
-a Part of the "current quadrant" (as defined by GBH);
-a-a-a Normal 48-bit VAS memory ops are relative to GBH;
-a-a-a Or, PCH for I$, but nominally PCH==GBH.
-a-a-a-a-a Except maybe if you want to do a Harvard Addressing thing.
In this case, can potentially key the high 48 bits to the ASID, so
things still work so long as GBH is assumed to not change within a given ASID, and the page-miss handler can upload pages in the cheaper 48-bit
VAS format rather than 96-bit VAS format (and sidestep the associativity cost).
But, have mostly ended up not using the 96-bit VAS stuff, as it is kinda overkill.
Even if, yeah, do have a chunk of encoding space being eaten by Load/
Store ops that are mostly being unused and exist primarily to deal with 128-bit pointers. In this case would still need a 96-bit TLBE whenever accessing a page in some other part of the VAS.
...
On 2026-02-08 9:28 p.m., Paul Clayton wrote:[snip]
[snip]Andr|- Seznec proposed a way to use (overlaid) skewed
associativity to support multiple page sizes ("Concurrent
Support of Multiple Page Sizes On a Skewed Associative TLB",
2003).
The article requires signing in with an IEEE membership. I used
to have a student membership many years ago. I let it expire as
I was not using it, this was before the internet was big. Every
so often I contemplate getting one, but I am trying to keep
costs down.
On 2026-02-09 5:27 p.m., BGB wrote:------------------------------------
On 2/9/2026 2:44 PM, EricP wrote:
I found a slide presentation on the idea.
I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if there is a different size page in one of the ways would it still work?
Robert Finch <robfi680@gmail.com> posted:
On 2026-02-09 5:27 p.m., BGB wrote:------------------------------------
On 2/9/2026 2:44 PM, EricP wrote:
I found a slide presentation on the idea.
I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if
there is a different size page in one of the ways would it still work?
One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.
In practice it is easier to implement and almost as good.
On 2/10/2026 1:04 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2026-02-09 5:27 p.m., BGB wrote:------------------------------------
On 2/9/2026 2:44 PM, EricP wrote:
I found a slide presentation on the idea.
I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if
there is a different size page in one of the ways would it still work?
One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.
In practice it is easier to implement and almost as good.
FWIW:
I ended up not using an MRU or LRU for the TLB.
While there is a loss of efficiency here (it always evicts the oldest
page rather than the least-used), it does mean that the TLB miss handler
can model the TLB deterministically (if the handler knows the length and width; it can be modeled exactly).
Potentially a clever TLB miss handler could compensate by re-injecting
the page that fell off the end if it is very likely to trigger another
TLB miss (possibly tracking commonly-missed pages in an STF MRU or
similar in software, and re-injecting pages that appear within this MRU
at a higher rank than whatever is in the current last place).
Another theoretical possibility would be to allow the CPU to access the contents of the TLB via MMIO, but there are strong reasons to not do
this.
So, naive FIFO ended up being preferable. The other possibility
being to have LRU, but then needing to fully evict a given set whenever
a pages' attributes are modified (such as a page swapped in or out of
the pagefile).
It is tradeoffs here...
BGB <cr88192@gmail.com> posted:
On 2/10/2026 1:04 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2026-02-09 5:27 p.m., BGB wrote:------------------------------------
On 2/9/2026 2:44 PM, EricP wrote:
I found a slide presentation on the idea.
I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work?
One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.
In practice it is easier to implement and almost as good.
FWIW:
I ended up not using an MRU or LRU for the TLB.
My NRU is comprised of a single S-R flip-flop per PTE, with all of them
wired to an (high-fan-in) AND gate.
When all NRU-bits are set, AND gate asserts, and all reset synchronously.
When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.
Simple and effective.
On 2/10/2026 1:56 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 2/10/2026 1:04 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2026-02-09 5:27 p.m., BGB wrote:------------------------------------
On 2/9/2026 2:44 PM, EricP wrote:
One would use the not-recently-used variant of LRU.
I found a slide presentation on the idea.
I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work? >>>
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.
In practice it is easier to implement and almost as good.
FWIW:
I ended up not using an MRU or LRU for the TLB.
My NRU is comprised of a single S-R flip-flop per PTE, with all of them wired to an (high-fan-in) AND gate.
When all NRU-bits are set, AND gate asserts, and all reset synchronously.
When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.
Simple and effective.
I presume NRU means "not recently used". But if when all the NRU bits
are set, all are reset, including the MRU entry's bit, might it get overwritten?
BGB <cr88192@gmail.com> posted:
On 2/10/2026 1:04 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2026-02-09 5:27 p.m., BGB wrote:------------------------------------
On 2/9/2026 2:44 PM, EricP wrote:
I found a slide presentation on the idea.
I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work?
One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.
In practice it is easier to implement and almost as good.
FWIW:
I ended up not using an MRU or LRU for the TLB.
My NRU is comprised of a single S-R flip-flop per PTE, with all of them
wired to an (high-fan-in) AND gate.
When all NRU-bits are set, AND gate asserts, and all reset synchronously.
When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.
Simple and effective.
While there is a loss of efficiency here (it always evicts the oldest
page rather than the least-used), it does mean that the TLB miss handler
can model the TLB deterministically (if the handler knows the length and
width; it can be modeled exactly).
Potentially a clever TLB miss handler could compensate by re-injecting
the page that fell off the end if it is very likely to trigger another
TLB miss (possibly tracking commonly-missed pages in an STF MRU or
similar in software, and re-injecting pages that appear within this MRU
at a higher rank than whatever is in the current last place).
Another theoretical possibility would be to allow the CPU to access the
contents of the TLB via MMIO, but there are strong reasons to not do
this.
MMI/O is potentially SLOW.
So, naive FIFO ended up being preferable. The other possibility
being to have LRU, but then needing to fully evict a given set whenever
a pages' attributes are modified (such as a page swapped in or out of
the pagefile).
True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).
It is tradeoffs here...
On 2026-02-10 4:56 p.m., MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 2/10/2026 1:04 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2026-02-09 5:27 p.m., BGB wrote:------------------------------------
On 2/9/2026 2:44 PM, EricP wrote:
One would use the not-recently-used variant of LRU.
I found a slide presentation on the idea.
I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU.
So, if
there is a different size page in one of the ways would it still work? >>>>
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.
In practice it is easier to implement and almost as good.
FWIW:
I ended up not using an MRU or LRU for the TLB.
My NRU is comprised of a single S-R flip-flop per PTE, with all of them
wired to an (high-fan-in) AND gate.
When all NRU-bits are set, AND gate asserts, and all reset synchronously.
When TLB takes a miss, Fint-First logic chooses a unary index for the new
PTE. Writing (or reading) a PTE sets its S-R FF.
Simple and effective.
While there is a loss of efficiency here (it always evicts the oldest
page rather than the least-used), it does mean that the TLB miss handler >>> can model the TLB deterministically (if the handler knows the length and >>> width; it can be modeled exactly).
Potentially a clever TLB miss handler could compensate by re-injecting
the page that fell off the end if it is very likely to trigger another
TLB miss (possibly tracking commonly-missed pages in an STF MRU or
similar in software, and re-injecting pages that appear within this MRU
at a higher rank than whatever is in the current last place).
Another theoretical possibility would be to allow the CPU to access the
contents of the TLB via MMIO, but there are strong reasons to not do
this.
MMI/O is potentially SLOW.
-a-a-a-a-a-a So, naive FIFO ended up being preferable. The other possibility
being to have LRU, but then needing to fully evict a given set whenever
a pages' attributes are modified (such as a page swapped in or out of
the pagefile).
True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).
I did not realize it was so many gates for LRU? I thought it was mainly latches. The LRU in use for Qupls has the the ways connected as a shift. Then new entries are always added at way three for a four-way TLB. way
zero gets evicted.
If it is lower gate count for NRU the TLB may be switched - another re- write.
The Qupls TLB is accessible as MMIO but this is probably unnecessary. It
is also updated by a SCP now instead of a state machine. The SCP also handles Amiga Copper style audio/video co-processing.
It is tradeoffs here...
On 2026-02-10 4:56 p.m., MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 2/10/2026 1:04 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2026-02-09 5:27 p.m., BGB wrote:------------------------------------
On 2/9/2026 2:44 PM, EricP wrote:
One would use the not-recently-used variant of LRU.
I found a slide presentation on the idea.
I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work? >>>
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.
In practice it is easier to implement and almost as good.
FWIW:
I ended up not using an MRU or LRU for the TLB.
My NRU is comprised of a single S-R flip-flop per PTE, with all of them wired to an (high-fan-in) AND gate.
When all NRU-bits are set, AND gate asserts, and all reset synchronously.
When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.
Simple and effective.
While there is a loss of efficiency here (it always evicts the oldest
page rather than the least-used), it does mean that the TLB miss handler >> can model the TLB deterministically (if the handler knows the length and >> width; it can be modeled exactly).
Potentially a clever TLB miss handler could compensate by re-injecting
the page that fell off the end if it is very likely to trigger another
TLB miss (possibly tracking commonly-missed pages in an STF MRU or
similar in software, and re-injecting pages that appear within this MRU
at a higher rank than whatever is in the current last place).
Another theoretical possibility would be to allow the CPU to access the
contents of the TLB via MMIO, but there are strong reasons to not do
this.
MMI/O is potentially SLOW.
So, naive FIFO ended up being preferable. The other possibility
being to have LRU, but then needing to fully evict a given set whenever
a pages' attributes are modified (such as a page swapped in or out of
the pagefile).
True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).
I did not realize it was so many gates for LRU? I thought it was mainly latches. The LRU in use for Qupls has the the ways connected as a shift. Then new entries are always added at way three for a four-way TLB. way
zero gets evicted.
If it is lower gate count for NRU the TLB may be switched - another re-write.
The Qupls TLB is accessible as MMIO but this is probably unnecessary. It
is also updated by a SCP now instead of a state machine. The SCP also handles Amiga Copper style audio/video co-processing.
It is tradeoffs here...
BGB <cr88192@gmail.com> posted:
On 2/10/2026 1:04 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2026-02-09 5:27 p.m., BGB wrote:------------------------------------
On 2/9/2026 2:44 PM, EricP wrote:
I found a slide presentation on the idea.
I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work?
One would use the not-recently-used variant of LRU.
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.
In practice it is easier to implement and almost as good.
FWIW:
I ended up not using an MRU or LRU for the TLB.
My NRU is comprised of a single S-R flip-flop per PTE, with all of them
wired to an (high-fan-in) AND gate.
When all NRU-bits are set, AND gate asserts, and all reset synchronously.
When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.
Simple and effective.
While there is a loss of efficiency here (it always evicts the oldest
page rather than the least-used), it does mean that the TLB miss handler
can model the TLB deterministically (if the handler knows the length and
width; it can be modeled exactly).
Potentially a clever TLB miss handler could compensate by re-injecting
the page that fell off the end if it is very likely to trigger another
TLB miss (possibly tracking commonly-missed pages in an STF MRU or
similar in software, and re-injecting pages that appear within this MRU
at a higher rank than whatever is in the current last place).
Another theoretical possibility would be to allow the CPU to access the
contents of the TLB via MMIO, but there are strong reasons to not do
this.
MMI/O is potentially SLOW.
So, naive FIFO ended up being preferable. The other possibility
being to have LRU, but then needing to fully evict a given set whenever
a pages' attributes are modified (such as a page swapped in or out of
the pagefile).
True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).
It is tradeoffs here...
On 2/10/2026 3:56 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 2/10/2026 1:04 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2026-02-09 5:27 p.m., BGB wrote:------------------------------------
On 2/9/2026 2:44 PM, EricP wrote:
One would use the not-recently-used variant of LRU.
I found a slide presentation on the idea.
I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU. So, if >>>> there is a different size page in one of the ways would it still work? >>>
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.
In practice it is easier to implement and almost as good.
FWIW:
I ended up not using an MRU or LRU for the TLB.
My NRU is comprised of a single S-R flip-flop per PTE, with all of them wired to an (high-fan-in) AND gate.
When all NRU-bits are set, AND gate asserts, and all reset synchronously.
When TLB takes a miss, Fint-First logic chooses a unary index for the new PTE. Writing (or reading) a PTE sets its S-R FF.
Simple and effective.
All pose a similar issue though in that they introduce behavior that is
not visible to the miss handler, and thus the exact state of the TLB
can't be modeled in software. Well, unless maybe there were some good
way to estimate this in SW and with a LDTLB which encoded which way to
load the TLBE into.
LDTLB: Load register pair into TLB, push-back by 1.
LDTLB{1/2/3/4}; Load into 1st/2nd/3rd/4th way of TLB.
Then again, in this case, NRU state would be 1024 bits, and could potentially be made accessible via a special mechanism (say, an LDNRU instruction). Would still be too bulky in this case to route it through CPUID, and wouldn't want to get the MMIO mechanism involved.
Well, maybe could special case-it:
MMIO requests to the TLB need not make it all the way to the MMIO bus,
the TLB could just see the request passing by and be like "hey, that
one's for me" and handle it directly as it passes by (a similar
mechanism having already used for LDTLB, which sort of reach the TLB
like specially malformed memory stores).
MMI/O is potentially SLOW.
Beyond this... If we are already paying the cost of a exception handler
for TLB Miss, then some extra MMIO isn't likely a huge issue.
True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).
oK.
I guess I can note that my use of LRU/MRU was most often based on STF,
in a naive case:
Access 3: Swap 3 and 2
Access 2: Swap 2 and 1
Access 1: Swap 1 and 0
Access 0: No Change
Say, for example, in my current 3D engine for things like height-map:
Fill a lookup table with random numbers generated from a PRNG (256 entries); Hash the X/Y or X/Y/Z coords ("(((X*P1+Y*P2+Z*P3)*P4)>>SH)&255");
Select an index from the lookup table, and repeat with adjusted indices
for adjacent entries;
Interpolate between these.
BGB <cr88192@gmail.com> posted:
On 2/10/2026 3:56 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 2/10/2026 1:04 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2026-02-09 5:27 p.m., BGB wrote:------------------------------------
On 2/9/2026 2:44 PM, EricP wrote:
One would use the not-recently-used variant of LRU.
I found a slide presentation on the idea.
I am wondering how a LRU system would work with the skewed entries. >>>>>> I assume the entries would be shifting between the ways for LRU. So, if >>>>>> there is a different size page in one of the ways would it still work? >>>>>
One accumulated used bits on a per set basis.
When all used-bits have been set, one clears
all used bits.
In practice it is easier to implement and almost as good.
FWIW:
I ended up not using an MRU or LRU for the TLB.
My NRU is comprised of a single S-R flip-flop per PTE, with all of them
wired to an (high-fan-in) AND gate.
When all NRU-bits are set, AND gate asserts, and all reset synchronously. >>>
When TLB takes a miss, Fint-First logic chooses a unary index for the new >>> PTE. Writing (or reading) a PTE sets its S-R FF.
Simple and effective.
All pose a similar issue though in that they introduce behavior that is
not visible to the miss handler, and thus the exact state of the TLB
can't be modeled in software. Well, unless maybe there were some good
way to estimate this in SW and with a LDTLB which encoded which way to
load the TLBE into.
LDTLB: Load register pair into TLB, push-back by 1.
LDTLB{1/2/3/4}; Load into 1st/2nd/3rd/4th way of TLB.
MIPS (company not Stanford) had the TLE logic find the entry to be
written (find first), so SW just hashed a big table, checked match
and if so, inserted new PTE where HW said.
IF you are going to do a SW TLB reloader, this is the way.
Then again, in this case, NRU state would be 1024 bits, and could
potentially be made accessible via a special mechanism (say, an LDNRU
instruction). Would still be too bulky in this case to route it through
CPUID, and wouldn't want to get the MMIO mechanism involved.
Well, maybe could special case-it:
MMIO requests to the TLB need not make it all the way to the MMIO bus,
the TLB could just see the request passing by and be like "hey, that
one's for me" and handle it directly as it passes by (a similar
mechanism having already used for LDTLB, which sort of reach the TLB
like specially malformed memory stores).
In wide super-scalar machines all of the memory 'paths' need this logic,
and sometimes ordering of MMI/O becomes a "little tricky". ---------------------
MMI/O is potentially SLOW.
Beyond this... If we are already paying the cost of a exception handler
for TLB Miss, then some extra MMIO isn't likely a huge issue.
MIPS TLB reloader was 17 cycles (when instructions were in cache)
I have witnessed MMI/O that was slower than 1-|s !! Also the control
register bus in Athlon/Opteron was SLOW, too:: so you can't always
use control registers in lieu of MMI/O ...
-----------------
True LRU is BigO( n^3 ) in logic gates; whereas NRU is BigO( k ).
oK.
I guess I can note that my use of LRU/MRU was most often based on STF,
in a naive case:
Access 3: Swap 3 and 2
Access 2: Swap 2 and 1
Access 1: Swap 1 and 0
Access 0: No Change
SWAP moves the whole size of PTE !! whereas NRU sets 1 bit.
Would never be allowed in a low-power implementation.
----------------
Say, for example, in my current 3D engine for things like height-map:
Fill a lookup table with random numbers generated from a PRNG (256 entries); >> Hash the X/Y or X/Y/Z coords ("(((X*P1+Y*P2+Z*P3)*P4)>>SH)&255");
Select an index from the lookup table, and repeat with adjusted indices
for adjacent entries;
Interpolate between these.
One of the cool things about HW hashing is bit reversal of -+ the fields in the hash--goes a long way to 'whiten' the noise signature.
kegs@provalid.com (Kent Dickey) posted:
In article <1770238707-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
kegs@provalid.com (Kent Dickey) posted:
An advantage of hashed page tables is the TLB itself has no "intermediate"
entries. This lets the primary TLB hold more useful translations. This >> is not much of an issue if the primary TLB is >= 256 entries, but it's >> really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
page tables advantages.
Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by >a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary" >TLB (or not with explanation) ?!?
I think the 32-entry fully associative is the primary TLB. The 1024 entry is a secondary TLB. Hits in the primary TLB cause no stalls at all.
Misses in the primary probably lead to stalls (but it is possible the
6 cycle latency can be hidden sometimes), so it costs performance.
If the primary TLB (the one with no stalls) is only 4 entries, then
we can easily calculate a pretty substantial stall penalty for most programs since it will pay the 6 cycle stall a lot (and that's for a L2
TLB hit). So we make the L1 TLB as big as possible, but it cannot be made too large or its too slow, so it ends up between 32- and 128-entries
(256 tops).
I'm just pointing out the intermediate hierarchical entries (the page
table entries which are just page table pointers) need to cached in
the TLB to make future lookups faster.
Cached yes, in TLB no--generally cached in TableWalk Accelerator store.
And putting them there reduces
the size of the TLB for "useful" entries.
Which is why you don't put 'em there.
You can try to avoid this penalty by only putting page table pointer entries in the L2 TLB, but then you're just making primary TLB misses slower, especially since the L2 TLB lookup cannot scan all entries like the primary can--you have to probe
each one separately.
If you want to translate the address 0x1234_56789000 using 4KB pages, you can do a SINGLE primary associative TLB lookup, and look for:
1) Page entry for 0x1234_56789000
2) Page-table pointer for 0x1234_566xxxxx
3) Page-table pointer for 0x1234_4xxxxxxx
4) Page-table pointer for 0x120x_xxxxxxxx
(you may need do a second lookup to get page-table entries, since it's expensive to deal with multiple hits in one clock).
Another reason to put them other than in TLB. One can make all 4 accesses simultaneously, and start the tablewalk as far down the tree as possible.
--- Synchronet 3.21d-Linux NewsLink 1.2
With the page-table-pointer for 0x1234_566xxxxx, you then do a single
data cache lookup to get the page entry (say 3 more clocks). So putting these page-table pointers in the L1 TLB has a big benefit to the page
walk times.
A delicate balance, especially if L2 is less than 10 cycles. ------------------------
Accounting for the overhead, it seems like almost any process will waste
at least 7 L1 TLB entries to these intermediate page table pointer
entries, so your 32-entry L1 TLB will generally hold no more than 25
useful pages.
So, you don't put 'em in TLB you put 'em in TW-Accelerator Store.
Kent
In article <1770238707-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
kegs@provalid.com (Kent Dickey) posted:
An advantage of hashed page tables is the TLB itself has no "intermediate" >>> entries. This lets the primary TLB hold more useful translations. This >>> is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed
page tables advantages.
Question: If one has a 32-E to 64E Fully Associative L1 TLB backed up by
a 6-cycle 1024-E 4-way L2 TLB--does the L1+L2 constitute its "primary"
TLB (or not with explanation) ?!?
I think the 32-entry fully associative is the primary TLB. The 1024 entry
is a secondary TLB. Hits in the primary TLB cause no stalls at all.
Misses in the primary probably lead to stalls (but it is possible the
6 cycle latency can be hidden sometimes), so it costs performance.
If the primary TLB (the one with no stalls) is only 4 entries, then
we can easily calculate a pretty substantial stall penalty for most
programs since it will pay the 6 cycle stall a lot (and that's for a L2
TLB hit). So we make the L1 TLB as big as possible, but it cannot be made too large or its too slow, so it ends up between 32- and 128-entries
(256 tops).
The article requires signing in with an IEEE membership. I used to have
a student membership many years ago. I let it expire as I was not using it, this was before the internet was big. Every so often I contemplate getting one, but I am trying to keep costs down.
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 59 |
| Nodes: | 6 (0 / 6) |
| Uptime: | 00:15:44 |
| Calls: | 810 |
| Files: | 1,287 |
| Messages: | 197,327 |