The solution to the excessive a[] traffic would be having the ability
to index the register file Ra[#] so the array can be allocated into
registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
The solution to the excessive a[] traffic would be having the abilityWhat are its drawbacks?
to index the register file Ra[#] so the array can be allocated into
registers and indexed from the file itself. Most ISAs do not have this
ability--although a few GPU ISAs do.
Do register accesses get slower?
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >used in rare circumstances.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.
It would have to be implemented. How?
And how does the supposed
rareness help?
Remember the subject: You suggested this mechanism as a way to
eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix multiplication, and the disadvantage was that VVM cannot not eliminate
some memory accesses that AVX-512 can. Turning the registers into
memory does not solve that, and probably incurs additional costs.
This cure is worse than the disease.
On 5/11/2026 10:14 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.
It would have to be implemented. How?
Let me give one possible implementation. There are certainly others.
Say you have 32 registers. They are "memory mapped" into the first 32 >addresses of memory. So programs would have to start not at zero, but
at 32 (I know this can cause other problems - I clearly have not thought >through all of the details.) So now when the CPU encounters a load (or >store) instruction where the virtual address is less than 32, it is >resolved not by the memory system, but by the appropriate register.
i.e. if the virtual address was say 4, the load would be from register
R4, not memory location 4.
Yes, the virtual addressing mechanism
would have to be sensitive to whether the address was below 32 or not,
but that is simple within the CPU. Note that the load instruction in
this case would not touch the memory system at all, so no cache lookups,
no TLB lookups, etc.
Remember the subject: You suggested this mechanism as a way to
eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix
multiplication, and the disadvantage was that VVM cannot not eliminate
some memory accesses that AVX-512 can. Turning the registers into
memory does not solve that, and probably incurs additional costs.
This cure is worse than the disease.
The register accesses are not turned into memory accesses.
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be used in rare circumstances.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.
The PDP-10 comes tom mind... with its "registers" which were just
aliases for the first memory locations, which were usually
implemented to be faster than main memory. You could even run
code there. But that fell out of fashion for a reason.
I believe that this would be particularly difficult with OoO.
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
On 5/11/2026 10:14 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.
It would have to be implemented. How?
Let me give one possible implementation. There are certainly others.
Say you have 32 registers. They are "memory mapped" into the first 32 >addresses of memory. So programs would have to start not at zero, but
at 32 (I know this can cause other problems - I clearly have not thought >through all of the details.) So now when the CPU encounters a load (or >store) instruction where the virtual address is less than 32, it is >resolved not by the memory system, but by the appropriate register.
i.e. if the virtual address was say 4, the load would be from register
R4, not memory location 4. Yes, the virtual addressing mechanism
would have to be sensitive to whether the address was below 32 or not,
but that is simple within the CPU. Note that the load instruction in
this case would not touch the memory system at all, so no cache lookups,
no TLB lookups, etc.
ERROR "unexpected byte sequence starting at index 85: '\xE2'" while decoding:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 7:17 PM, Lawrence D|o-C-OOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
Doesn|o-C-Ot this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >used in rare circumstances.
It would have to be implemented. How? And how does the supposed
rareness help?
Remember the subject: You suggested this mechanism as a way to
eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix
multiplication, and the disadvantage was that VVM cannot not eliminate--- Synchronet 3.22a-Linux NewsLink 1.2
some memory accesses that AVX-512 can. Turning the registers into
memory does not solve that, and probably incurs additional costs.
This cure is worse than the disease.
- anton
On 5/11/2026 10:14 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >> used in rare circumstances.
It would have to be implemented. How?
Let me give one possible implementation. There are certainly others.
Say you have 32 registers. They are "memory mapped" into the first 32 addresses of memory.
So programs would have to start not at zero, but
at 32 (I know this can cause other problems - I clearly have not thought through all of the details.) So now when the CPU encounters a load (or store) instruction where the virtual address is less than 32, it is resolved not by the memory system, but by the appropriate register.
i.e. if the virtual address was say 4, the load would be from register
R4, not memory location 4.
Yes, the virtual addressing mechanism
would have to be sensitive to whether the address was below 32 or not,
but that is simple within the CPU. Note that the load instruction in
this case would not touch the memory system at all, so no cache lookups,
no TLB lookups, etc.
And how does the supposed
rareness help?
Laurence said it would defeat the purpose of registers. My comment was
that since it would be rare, i.e. most of the register references would
be the same as before, it wouldn't defeat the purpose.
Remember the subject: You suggested this mechanism as a way to
eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix multiplication, and the disadvantage was that VVM cannot not eliminate
some memory accesses that AVX-512 can. Turning the registers into
memory does not solve that, and probably incurs additional costs.
This cure is worse than the disease.
The register accesses are not turned into memory accesses. If the
address is less than 32, the instruction references the actual register,
not the memory. The only advantage of this scheme is that it allows "indexing" the registers similarly to how one indexes memory today.
ERROR "unexpected byte sequence starting at index 185: '\xE2'" while decoding:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 10:14 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 7:17 PM, Lawrence D|o-C-OOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>> "index" the registers, similarly to indexing a memory array.
Doesn|o-C-Ot this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But >>> you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.
It would have to be implemented. How?
Let me give one possible implementation. There are certainly others.
Say you have 32 registers. They are "memory mapped" into the first 32 >addresses of memory. So programs would have to start not at zero, but
at 32 (I know this can cause other problems - I clearly have not thought >through all of the details.) So now when the CPU encounters a load (or >store) instruction where the virtual address is less than 32, it is >resolved not by the memory system, but by the appropriate register.
i.e. if the virtual address was say 4, the load would be from register
R4, not memory location 4.
How is that done?
You have a load instruction in the scheduler. How
does it know when the "memory" location it is waiting for is ready?
The register renamer knows nothing about the address that the
instruction accesses, because the address will usually come in at
run-time (otherwise one would use a direct register access). So
should every load and store wait for all physical registers to be
ready that corresponded to logical registers when the instruction came through the register renamer? That would slow down all loads and
stores. If not, how else should it be done?
Once we know all the registers are ready, the instruction is admitted--- Synchronet 3.22a-Linux NewsLink 1.2
to the load/store unit. Now you have to add an extra data path that
accesses the physical register. That's a completely new data path
with an additional register port that will cost area, and possibly
gate delays.
When the load actually has its value, it will continue to the ROB, and
only when it is retired, will all the 32 or 64 registers that it has
reserved will lower their reference count (and of course, you have to
add hardware for performing this many updates, whereas normal
instructions only update a few register reference counts).
And for stores, similar problems show up that also have to be dealt with.
Hardware has been optimized for fast access to registers encoded
directly in instructions, and for (in the usual case) fast access to
memory where the address is determined at run-time.
There are reasons why architectural features that allowed accessing
registers with memory instructions have vanished, even before OoO
execution became popular, and the developments since then, in
particular OoO execution, have added more reasons.
Yes, the virtual addressing mechanism
would have to be sensitive to whether the address was below 32 or not,
but that is simple within the CPU. Note that the load instruction in
this case would not touch the memory system at all, so no cache lookups, >no TLB lookups, etc.
The alternative is unlikely to be any faster.
Remember the subject: You suggested this mechanism as a way to
eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix
multiplication, and the disadvantage was that VVM cannot not eliminate
some memory accesses that AVX-512 can. Turning the registers into
memory does not solve that, and probably incurs additional costs.
This cure is worse than the disease.
The register accesses are not turned into memory accesses.
Then why map the registers into memory and let load/store instructions
(or uops) access them?
Having separate read-indexed-reg and write-indexed-reg instructions
would avoid burdening the load/store unit with this idea, but would
still suffer from many of the problems mentioned above. There are
reasons why no modern architecture has such instructions.
- anton
ERROR "unexpected byte sequence starting at index 185: '\xE2'" while decoding:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 10:14 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 7:17 PM, Lawrence D|o-C-OOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>> "index" the registers, similarly to indexing a memory array.
Doesn|o-C-Ot this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But >>> you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.
It would have to be implemented. How?
Let me give one possible implementation. There are certainly others.
Say you have 32 registers. They are "memory mapped" into the first 32 >addresses of memory. So programs would have to start not at zero, but
at 32 (I know this can cause other problems - I clearly have not thought >through all of the details.) So now when the CPU encounters a load (or >store) instruction where the virtual address is less than 32, it is >resolved not by the memory system, but by the appropriate register.
i.e. if the virtual address was say 4, the load would be from register
R4, not memory location 4. Yes, the virtual addressing mechanism
would have to be sensitive to whether the address was below 32 or not,
but that is simple within the CPU. Note that the load instruction in
this case would not touch the memory system at all, so no cache lookups, >no TLB lookups, etc.
It's really pretty straightforward - consider a SATA controller,
for instance - it has a bunch of registers, generally implemented
with flops that are accessable by reading and writing specific addresses
in the system address map.
There's no reason that a certain range of physical (or even virtual) addresses cannot refer to a set of flopped registers rather than the underlying memory architecture; such addresses being known to the
processor as special and intercepted before reaching the
processor interconnect fabric. Either a fixed address, or derived
from a programmable base register.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 5/11/2026 10:14 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:No. In the vast majority of cases, you reference registers as you do
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>>> "index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work? >>>>
now, with register numbers in assigned places in the instruction. But >>>> you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >>>> used in rare circumstances.
It would have to be implemented. How?
Let me give one possible implementation. There are certainly others.
Say you have 32 registers. They are "memory mapped" into the first 32
addresses of memory.
Very PDP-6 of you.
But you really do not want code and data (registers) in the same page,
so just start at page[1].
But ?!? what do you do with NULL pointers ??? now you actually can dereference them.
No, I think this brings more problems than solutions.
So programs would have to start not at zero, but
at 32 (I know this can cause other problems - I clearly have not thought
through all of the details.) So now when the CPU encounters a load (or
store) instruction where the virtual address is less than 32, it is
resolved not by the memory system, but by the appropriate register.
i.e. if the virtual address was say 4, the load would be from register
R4, not memory location 4.
Do you really want to have a LD access span registers ?!?
Yes, the virtual addressing mechanism
would have to be sensitive to whether the address was below 32 or not,
That is actually easy enough to do VAS[HOB..5] == 0
but that is simple within the CPU. Note that the load instruction in
this case would not touch the memory system at all, so no cache lookups,
no TLB lookups, etc.
And how does the supposed
rareness help?
Laurence said it would defeat the purpose of registers. My comment was
that since it would be rare, i.e. most of the register references would
be the same as before, it wouldn't defeat the purpose.
Remember the subject: You suggested this mechanism as a way to
eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix
multiplication, and the disadvantage was that VVM cannot not eliminate
some memory accesses that AVX-512 can. Turning the registers into
memory does not solve that, and probably incurs additional costs.
This cure is worse than the disease.
The register accesses are not turned into memory accesses. If the
address is less than 32, the instruction references the actual register,
LDD Rd,[31] spans R3 and R4
Very PDP-6 of you.
But you really do not want code and data (registers) in the same
page, so just start at page[1].
Let me give one possible implementation.-a There are certainly others.
Say you have 32 registers.-a They are "memory mapped" into the first 32 addresses of memory.-a So programs would have to start not at zero, but
at 32 (I know this can cause other problems - I clearly have not thought through all of the details.)-a So now when the CPU encounters a load (or store) instruction where the virtual address is less-a than 32, it is resolved not by the memory system, but by the appropriate register. i.e.
if the virtual address was say 4, the load would be from register R4,
not memory location 4.-a-a-a Yes, the virtual addressing mechanism would have to be sensitive to whether the address was below 32 or not, but
that is simple within the CPU.-a Note that the load instruction in this
case would not touch the memory system at all, so no cache lookups, no
TLB lookups, etc.
On 5/12/2026 11:12 AM, MitchAlsup wrote:
Let me give one possible implementation. There are certainly others.
Say you have 32 registers. They are "memory mapped" into the first 32
addresses of memory.
Very PDP-6 of you.
:-) I did say that others had implemented it, and other posters gave >several examples
But you really do not want code and data (registers) in the same page,
so just start at page[1].
Good Point.
But ?!? what do you do with NULL pointers ??? now you actually can
dereference them.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 10:14 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:No. In the vast majority of cases, you reference registers as you do
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>>> "index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work? >>>>
now, with register numbers in assigned places in the instruction. But >>>> you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >>>> used in rare circumstances.
It would have to be implemented. How?
Let me give one possible implementation. There are certainly others.
Say you have 32 registers. They are "memory mapped" into the first 32
addresses of memory. So programs would have to start not at zero, but
at 32 (I know this can cause other problems - I clearly have not thought
through all of the details.) So now when the CPU encounters a load (or
store) instruction where the virtual address is less than 32, it is
resolved not by the memory system, but by the appropriate register.
i.e. if the virtual address was say 4, the load would be from register
R4, not memory location 4.
How is that done? You have a load instruction in the scheduler. How
does it know when the "memory" location it is waiting for is ready?
The register renamer knows nothing about the address that the
instruction accesses, because the address will usually come in at
run-time (otherwise one would use a direct register access). So
should every load and store wait for all physical registers to be
ready that corresponded to logical registers when the instruction came through the register renamer? That would slow down all loads and
stores. If not, how else should it be done?
Once we know all the registers are ready, the instruction is admitted
to the load/store unit. Now you have to add an extra data path that
accesses the physical register. That's a completely new data path
with an additional register port that will cost area, and possibly
gate delays.
When the load actually has its value, it will continue to the ROB, and
only when it is retired, will all the 32 or 64 registers that it has
reserved will lower their reference count (and of course, you have to
add hardware for performing this many updates, whereas normal
instructions only update a few register reference counts).
And for stores, similar problems show up that also have to be dealt with.
Hardware has been optimized for fast access to registers encoded
directly in instructions, and for (in the usual case) fast access to
memory where the address is determined at run-time.
There are reasons why architectural features that allowed accessing
registers with memory instructions have vanished, even before OoO
execution became popular, and the developments since then, in
particular OoO execution, have added more reasons.
Yes, the virtual addressing mechanism
would have to be sensitive to whether the address was below 32 or not,
but that is simple within the CPU. Note that the load instruction in
this case would not touch the memory system at all, so no cache lookups,
no TLB lookups, etc.
The alternative is unlikely to be any faster.
Remember the subject: You suggested this mechanism as a way to
eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix
multiplication, and the disadvantage was that VVM cannot not eliminate
some memory accesses that AVX-512 can. Turning the registers into
memory does not solve that, and probably incurs additional costs.
This cure is worse than the disease.
The register accesses are not turned into memory accesses.
Then why map the registers into memory and let load/store instructions
(or uops) access them?
Having separate read-indexed-reg and write-indexed-reg instructions
would avoid burdening the load/store unit with this idea,
but would
still suffer from many of the problems mentioned above. There are
reasons why no modern architecture has such instructions.
On 5/12/26 08:11, Stephen Fuld wrote:
Let me give one possible implementation.-a There are certainly others.
Say you have 32 registers.-a They are "memory mapped" into the first 32
addresses of memory.-a So programs would have to start not at zero, but
at 32 (I know this can cause other problems - I clearly have not
thought through all of the details.)-a So now when the CPU encounters a
load (or store) instruction where the virtual address is less-a than
32, it is resolved not by the memory system, but by the appropriate
register. i.e. if the virtual address was say 4, the load would be
from register R4, not memory location 4.-a-a-a Yes, the virtual
addressing mechanism would have to be sensitive to whether the address
was below 32 or not, but that is simple within the CPU.-a Note that the
load instruction in this case would not touch the memory system at
all, so no cache lookups, no TLB lookups, etc.
Why not look at the problem from another perspective?
Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, accessible in 1 or 2 clocks, and two transfer instructions
ldqr Rd, <index>
stqr Rd, <index>
This should work our perfectly even in a tight vVM loop.
On Tue, 12 May 2026 18:12:51 GMT, MitchAlsup wrote:
Very PDP-6 of you.
But you really do not want code and data (registers) in the same
page, so just start at page[1].
But remember, in those days, impure code was considered a feature, not
a bug. ;)
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/12/2026 11:12 AM, MitchAlsup wrote:
Let me give one possible implementation. There are certainly others.
Say you have 32 registers. They are "memory mapped" into the first 32 >>> addresses of memory.
Very PDP-6 of you.
:-) I did say that others had implemented it, and other posters gave >several examples
But you really do not want code and data (registers) in the same page,
so just start at page[1].
Good Point.
While true, most modern operating systems mark code pages read-only;
which would make it impossible to have writable data registers in
the same page as code anyway.
The B3500 (1964) had a number of reserved fields in the first 100 digits
of memory (stack pointer, three index registers,
indirect field lengths, SCAN instruction storage, an array of
'insert' characters for the EDT (edit - e.g. COBOL PIC clause formatting) instruction, etc). Each program had 1 million digits of memory,
but due to the max offset of a branch instruction, code was
limited to the first 300,000 digits.
When we introduced a new memory management scheme in the early 1980s,
we introduced a segmented scheme, where segment zero was application
data (including the reserved fields in the first 100 digits) and
stack, while segment 1 was code (and read-only). The remaining
6 segments in an 'environment' were reserved for application data.
To maintain backward compatability, the segments were limited to
1 megadigit in size. The processor supported up to 100,000
environments per task - where memory area zero was shared across
all environments. A non-local call instruction would switch
environments. FWIW, the stack grew towards higher addresses.
But ?!? what do you do with NULL pointers ??? now you actually can
dereference them.
On that system, a NULL pointer was encoded as the six digit
value @EEEEEE@,
which since it contained undigits, was otherwise--- Synchronet 3.22a-Linux NewsLink 1.2
an invalid address. There were instructions to search
linked lists, and the NULL pointer value was honored by
the hardware during searches (setting a condition flag
when encountered by the instruction).
On 5/12/26 08:11, Stephen Fuld wrote:
Let me give one possible implementation.-a There are certainly others.
Say you have 32 registers.-a They are "memory mapped" into the first 32 addresses of memory.-a So programs would have to start not at zero, but
at 32 (I know this can cause other problems - I clearly have not thought through all of the details.)-a So now when the CPU encounters a load (or store) instruction where the virtual address is less-a than 32, it is resolved not by the memory system, but by the appropriate register. i.e. if the virtual address was say 4, the load would be from register R4,
not memory location 4.-a-a-a Yes, the virtual addressing mechanism would have to be sensitive to whether the address was below 32 or not, but
that is simple within the CPU.-a Note that the load instruction in this case would not touch the memory system at all, so no cache lookups, no
TLB lookups, etc.
Why not look at the problem from another perspective?
Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, accessible in 1 or 2 clocks, and two transfer instructions
ldqr Rd, <index>
stqr Rd, <index>
This should work our perfectly even in a tight vVM loop.
--
Bernd Linsel
I take a good portion of the blame for your over-emphasis on issues with
the load unit within the CPU and issues with using it for the indexed >register moves. I repeat, the instruction will almost certainly not be >implemented internally in the load/store functional unit(s).
but would
still suffer from many of the problems mentioned above. There are
reasons why no modern architecture has such instructions.
While I agree that no modern architecture has such instructions, neither >does any have VVM.
The question is whether the cost of such an implementation is justified
to improve VVM. To that question, I leave the answer to others.
Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, accessible in 1 or 2 clocks, and two transfer instructions
ldqr Rd, <index>
stqr Rd, <index>
This should work our perfectly even in a tight vVM loop.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
I take a good portion of the blame for your over-emphasis on issues with >the load unit within the CPU and issues with using it for the indexed >register moves. I repeat, the instruction will almost certainly not be >implemented internally in the load/store functional unit(s).
If you encode it as load/store instruction, with the address (that the decoder knows nothing about) deciding whether it's a register or an
address, then it will be treated as load or store and scheduled for
the load/store unit.
but would
still suffer from many of the problems mentioned above. There are
reasons why no modern architecture has such instructions.
While I agree that no modern architecture has such instructions, neither >does any have VVM.
Sure, and maybe none ever will.
I am somewhat sceptical of auto-vectorization in all forms, but at least vVM does not go totally
against the grain of modern high-performance CPU design. Indexed
access to registers does.
The question is whether the cost of such an implementation is justified
to improve VVM. To that question, I leave the answer to others.
My take is that it isn't.
- anton--- Synchronet 3.22a-Linux NewsLink 1.2
On 5/13/26 14:02, Bernd Linsel wrote:
Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, accessible in 1 or 2 clocks, and two transfer instructions
ldqr Rd, <index>
stqr Rd, <index>
This should work our perfectly even in a tight vVM loop.
Should of course read
ldqr Rd, Rs // Rs indexes into ultra-fast on-chip SRAM
stqr Rs1, Rs2 // Rs2 indexes into ultra-fast on-chip SRAM
I think "direct addressing" with an immediate index instead of via an
index in a register is not needed.
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:
On Tue, 12 May 2026 18:12:51 GMT, MitchAlsup wrote:
Very PDP-6 of you.
But you really do not want code and data (registers) in the same
page, so just start at page[1].
But remember, in those days, impure code was considered a feature, not
a bug. ;)
And we barely survived...
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
You have a load instruction in the scheduler. How
does it know when the "memory" location it is waiting for is ready?
The scheduler is only waiting for operand registers to be ready.
The register renamer knows nothing about the address that the
instruction accesses, because the address will usually come in at
run-time (otherwise one would use a direct register access). So
should every load and store wait for all physical registers to be
ready that corresponded to logical registers when the instruction came
through the register renamer? That would slow down all loads and
stores. If not, how else should it be done?
Only use it when you don't care about that stuff.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
You have a load instruction in the scheduler. How
does it know when the "memory" location it is waiting for is ready?
The scheduler is only waiting for operand registers to be ready.
Let's assume that you have
A: load r1<-mem_location
B: load r5<-mem_location
C: load r2<-(r1)
D: add r5<-r5+1
and the architectural value loaded into r1 is the address of r5. What "operand registers" should the instruction C wait for?
1) Should it wait for r1? If so, how does know when it can access r5?
How does it know which physical register r5 lives in?
2) Should it wait for r1 and r5? How does it learn that it should
wait for r5?
3) Should it wait for r1, and when that is known and points to the
address of r5, enter a scheduler that waits for r5? Again, how does
it learn what the physical register for r5 is at C? And how is it
known that D is not the only user?
If you add storing to registers in indexed ways, additional issues
come into play:
A: load r1<-mem_location
B: load r5<-mem_location
C: store r2->(r1)
D: add r5<-r5+1
Now the renamer has to deal with the possibility that C may or may not
store to the architectural register that B and D access.
One solution for both the load and the store problems would be to
predict the accessed register (if any) and do a misprediction recovery
if the prediction turns out to be wrong.
- anton--- Synchronet 3.22a-Linux NewsLink 1.2
Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com> posted:
On 5/13/26 14:02, Bernd Linsel wrote:
Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core,
accessible in 1 or 2 clocks, and two transfer instructions
ldqr Rd, <index>
stqr Rd, <index>
This should work our perfectly even in a tight vVM loop.
Should of course read
ldqr Rd, Rs // Rs indexes into ultra-fast on-chip SRAM
stqr Rs1, Rs2 // Rs2 indexes into ultra-fast on-chip SRAM
I think "direct addressing" with an immediate index instead of via an
index in a register is not needed.
How do you access a different register each loop iteration ???
if you don't have indexing ???
On 5/13/26 22:52, MitchAlsup wrote:
Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com> posted:
On 5/13/26 14:02, Bernd Linsel wrote:
Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core,
accessible in 1 or 2 clocks, and two transfer instructions
ldqr Rd, <index>
stqr Rd, <index>
This should work our perfectly even in a tight vVM loop.
Should of course read
ldqr Rd, Rs-a-a-a-a-a // Rs indexes into ultra-fast on-chip SRAM
stqr Rs1, Rs2-a-a-a // Rs2 indexes into ultra-fast on-chip SRAM
I think "direct addressing" with an immediate index instead of via an
index in a register is not needed.
How do you access a different register each loop iteration ???
if you don't have indexing ???
It's meant as:
ld Rd, qregs[Rd] and
st Rs1, qregs[Rs2],
i.e. the second register as index into the "quick regs" local SRAM bank,
Only aligned full word access possible should be sufficient, so that
these are really indices, not addresses.
On 5/14/2026 3:03 PM, Bernd Linsel wrote:
On 5/13/26 22:52, MitchAlsup wrote:
Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com> posted:
On 5/13/26 14:02, Bernd Linsel wrote:
Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core,
accessible in 1 or 2 clocks, and two transfer instructions
ldqr Rd, <index>
stqr Rd, <index>
This should work our perfectly even in a tight vVM loop.
Should of course read
ldqr Rd, Rs-a-a-a-a-a // Rs indexes into ultra-fast on-chip SRAM
stqr Rs1, Rs2-a-a-a // Rs2 indexes into ultra-fast on-chip SRAM
I think "direct addressing" with an immediate index instead of via an
index in a register is not needed.
How do you access a different register each loop iteration ???
if you don't have indexing ???
It's meant as:
ld Rd, qregs[Rd] and
st Rs1, qregs[Rs2],
i.e. the second register as index into the "quick regs" local SRAM bank, Only aligned full word access possible should be sufficient, so that
these are really indices, not addresses.
I must be missing something. Doesn't this quick regs memory have to be saved and restored on each context switch? If so, that is very expensive.
Let me give one possible implementation. There are certainly others. Say
you have 32 registers. They are "memory mapped" into the first 32 addresses of memory. So programs would have to start not at zero, but at 32 (I know this can cause other problems - I clearly have not thought through all of
the details.) So now when the CPU encounters a load (or store) instruction where the virtual address is less than 32, it is resolved not by the memory system, but by the appropriate register. i.e. if the virtual address was say 4, the load would be from register R4, not memory location 4. Yes, the virtual addressing mechanism would have to be sensitive to whether the address was below 32 or not, but that is simple within the CPU. Note that the load instruction in this case would not touch the memory system at all, so no cache lookups, no TLB lookups, etc.
It'd actually be easier to just add a new instruction for indirect
register access (no need to burden the load/store unit, no need to worry >about access size and alignment, memory remapping, and whatnot).
The implementation problem, AFAIK comes in with OoO: by the time your >instruction (whether a load or a dedicated instruction) gets to know
which register it needs to read, we're in the middle of the OoO engine,
and the first thing it needs to do is to figure out which physical
register corresponds to this logical register (and it needs to find out
also if that physical register's value has already been delivered).
The needed information is definitely out there somewhere in the CPU,
but I'm not sure it can be made available cheaply at that time&place.
Some GPUs offer such indirect register addressing, but they have a very >different microarchitecture, much more like a barrel processor, with no
OoO execution in sight.
Stephen Fuld [2026-05-11 23:11:07] wrote:
Let me give one possible implementation. There are certainly others. Say
you have 32 registers. They are "memory mapped" into the first 32 addresses >> of memory. So programs would have to start not at zero, but at 32 (I know >> this can cause other problems - I clearly have not thought through all of
the details.) So now when the CPU encounters a load (or store) instruction >> where the virtual address is less than 32, it is resolved not by the memory >> system, but by the appropriate register. i.e. if the virtual address was say >> 4, the load would be from register R4, not memory location 4. Yes, the
virtual addressing mechanism would have to be sensitive to whether the
address was below 32 or not, but that is simple within the CPU. Note that >> the load instruction in this case would not touch the memory system at all, >> so no cache lookups, no TLB lookups, etc.
That solves the problem of encoding an indirect register access as
a LD/ST instruction, but I highly doubt that's the main problem
introduced by indirect register access.
It'd actually be easier to just add a new instruction for indirect
register access (no need to burden the load/store unit, no need to worry about access size and alignment, memory remapping, and whatnot).
The implementation problem, AFAIK comes in with OoO: by the time your instruction (whether a load or a dedicated instruction) gets to know
which register it needs to read, we're in the middle of the OoO engine,
and the first thing it needs to do is to figure out which physical
register corresponds to this logical register (and it needs to find out
also if that physical register's value has already been delivered).
The needed information is definitely out there somewhere in the CPU,
but I'm not sure it can be made available cheaply at that time&place.
Some GPUs offer such indirect register addressing, but they have a very different microarchitecture, much more like a barrel processor, with no
OoO execution in sight.
=== Stefan
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be used in rare circumstances.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Thomas Koenig <tkoenig@netcologne.de> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:No. In the vast majority of cases, you reference registers as you do >>> now, with register numbers in assigned places in the instruction. But >>> you do have an "alternate" way of referencing them that allows you to >>> use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>> "index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work? >>>
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.
That could compete with cache, and still cause memory traffic.
I am not sure how this would compare with just loading the
values into the cache on the first iteration.
Thomas Koenig <tkoenig@netcologne.de> posted:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:No. In the vast majority of cases, you reference registers as you do >> >>> now, with register numbers in assigned places in the instruction. But >> >>> you do have an "alternate" way of referencing them that allows you to >> >>> use an index, just as you can with memory. That mechanism would only be
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work? >> >>>
used in rare circumstances.
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
That would prevent thread[k] from allowing thread[j] access to its
thread local store via shared pointer.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Thomas Koenig <tkoenig@netcologne.de> posted:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >> >>>>> "index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do >> >>> now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to >> >>> use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
That would prevent thread[k] from allowing thread[j] access to its
thread local store via shared pointer.
Not for all fo the thread's memory, I was thinking of this as a
separate flag, to be set only for special purposes (such as above).
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:How does one {programmer or OS} glean that the bit can be set ??
Thomas Koenig <tkoenig@netcologne.de> posted:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >> >> >>>>> "index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
That would prevent thread[k] from allowing thread[j] access to its
thread local store via shared pointer.
Not for all fo the thread's memory, I was thinking of this as a
separate flag, to be set only for special purposes (such as above).
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:No. In the vast majority of cases, you reference registers as you do >>>> now, with register numbers in assigned places in the instruction. But >>>> you do have an "alternate" way of referencing them that allows you to >>>> use an index, just as you can with memory. That mechanism would only be >>>> used in rare circumstances.
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>>> "index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work? >>>>
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.
That could compete with cache, and still cause memory traffic.
Thomas Koenig <tkoenig@netcologne.de> writes:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:No. In the vast majority of cases, you reference registers as you do >>>>> now, with register numbers in assigned places in the instruction. But >>>>> you do have an "alternate" way of referencing them that allows you to >>>>> use an index, just as you can with memory. That mechanism would only be >>>>> used in rare circumstances.
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>>>> "index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work? >>>>>
Maybe one way to implement this would be to treat a special region, >>>>like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.
That could compete with cache, and still cause memory traffic.
The OS can designate that page as 'noncacheble', so no
coherency traffic necessary. It would simply be a faster
page of memory, with access times closer to cache than DRAM
and shared by multiple cores (with appropriate software care).
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 5/11/2026 7:17 PM, Lawrence DrCOOliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:No. In the vast majority of cases, you reference registers as you do >>>>> now, with register numbers in assigned places in the instruction. But >>>>> you do have an "alternate" way of referencing them that allows you to >>>>> use an index, just as you can with memory. That mechanism would only be
A possible alternative that I have seen is to "memory map" the >>>>>>> registers as an alternative accessing mechanism. This allows you to >>>>>>> "index" the registers, similarly to indexing a memory array.
DoesnrCOt this defeat the point of how registers are supposed to work? >>>>>
used in rare circumstances.
Maybe one way to implement this would be to treat a special region, >>>>like local variable addressed in a certain range relative to the >>>>stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.
That could compete with cache, and still cause memory traffic.
The OS can designate that page as 'noncacheble', so no
coherency traffic necessary.
It would simply be a faster
page of memory, with access times closer to cache than DRAM
and shared by multiple cores (with appropriate software care).
That is of course a possibility.
Thomas Koenig <tkoenig@netcologne.de> posted:
One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).
The LDD using R6 as an index can be hoisted into Loop2 prologue.
{I did miss that}.
With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.
Plus, the setup time for VVM...
I have been thinking about this overnight and may have a solution
that alters only the VEC instruction.
On 5/2/2026 11:46 AM, MitchAlsup wrote:
big snip
Thomas Koenig <tkoenig@netcologne.de> posted:
One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).
The LDD using R6 as an index can be hoisted into Loop2 prologue.
{I did miss that}.
With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.
Plus, the setup time for VVM...
I have been thinking about this overnight and may have a solution
that alters only the VEC instruction.
Any progress?
Stephen Fuld [2026-05-11 23:11:07] wrote:
Let me give one possible implementation. There are certainly others. Say
you have 32 registers. They are "memory mapped" into the first 32 addresses >> of memory. So programs would have to start not at zero, but at 32 (I know >> this can cause other problems - I clearly have not thought through all of
the details.) So now when the CPU encounters a load (or store) instruction >> where the virtual address is less than 32, it is resolved not by the memory >> system, but by the appropriate register. i.e. if the virtual address was say >> 4, the load would be from register R4, not memory location 4. Yes, the
virtual addressing mechanism would have to be sensitive to whether the
address was below 32 or not, but that is simple within the CPU. Note that >> the load instruction in this case would not touch the memory system at all, >> so no cache lookups, no TLB lookups, etc.
That solves the problem of encoding an indirect register access as
a LD/ST instruction, but I highly doubt that's the main problem
introduced by indirect register access.
It'd actually be easier to just add a new instruction for indirect
register access (no need to burden the load/store unit, no need to worry about access size and alignment, memory remapping, and whatnot).
The implementation problem, AFAIK comes in with OoO: by the time your instruction (whether a load or a dedicated instruction) gets to know
which register it needs to read, we're in the middle of the OoO engine,
and the first thing it needs to do is to figure out which physical
register corresponds to this logical register (and it needs to find out
also if that physical register's value has already been delivered).
The needed information is definitely out there somewhere in the CPU,
but I'm not sure it can be made available cheaply at that time&place.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 5/14/2026 3:03 PM, Bernd Linsel wrote:
On 5/13/26 22:52, MitchAlsup wrote:
Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com> posted:
On 5/13/26 14:02, Bernd Linsel wrote:
Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, >>>>>> accessible in 1 or 2 clocks, and two transfer instructions
ldqr Rd, <index>
stqr Rd, <index>
This should work our perfectly even in a tight vVM loop.
Should of course read
ldqr Rd, Rs-a-a-a-a-a // Rs indexes into ultra-fast on-chip SRAM
stqr Rs1, Rs2-a-a-a // Rs2 indexes into ultra-fast on-chip SRAM
I think "direct addressing" with an immediate index instead of via an >>>>> index in a register is not needed.
How do you access a different register each loop iteration ???
if you don't have indexing ???
It's meant as:
ld Rd, qregs[Rd] and
st Rs1, qregs[Rs2],
OK, that solves the indexing issue.
i.e. the second register as index into the "quick regs" local SRAM bank, >>> Only aligned full word access possible should be sufficient, so that
these are really indices, not addresses.
I must be missing something. Doesn't this quick regs memory have to be
saved and restored on each context switch? If so, that is very expensive.
qregs[] is (IS) the actual register file (or files)--so, no added state.
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 65 |
| Nodes: | 6 (0 / 6) |
| Uptime: | 01:47:54 |
| Calls: | 862 |
| Files: | 1,311 |
| D/L today: |
10 files (20,373K bytes) |
| Messages: | 264,321 |