Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 23 |
Nodes: | 6 (0 / 6) |
Uptime: | 54:07:16 |
Calls: | 583 |
Files: | 1,139 |
D/L today: |
179 files (27,921K bytes) |
Messages: | 111,699 |
I did a test coding the sum128 as a code word with avx-512 instructions
and got the following results
285,584,376 cycles:u
941,856,077 instructions:u
timing was
timer-reset ' recursive-sum bench .elapsed 51 ms elapsed
so half the time of the original recursive.
with 32 zmm registers I could have done a sum256 also
; Horizontal sum of zmm0
vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0
vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2
vhaddpd xmm0, xmm4, xmm4
One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.
peter <peter.noreply@tin.it> writes:
I did a test coding the sum128 as a code word with avx-512 instructions
and got the following results
285,584,376 cycles:u
941,856,077 instructions:u
timing was
timer-reset ' recursive-sum bench .elapsed 51 ms elapsed
so half the time of the original recursive.
with 32 zmm registers I could have done a sum256 also
One could do sum128 with just 8 registers by performing the adds ASAP,
i.e., for sum32
vmovapd zmm0, [rbx]
vmovapd zmm1, [rbx+64]
vaddpd zmm0, zmm0, zmm1
vmovapd zmm1, [rbx+128]
vmovapd zmm2, [rbx+192]
vaddpd zmm1, zmm1, zmm2
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum
And you can code this as:
vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vmovapd zmm1, [rbx+128]
vaddpd zmm1, zmm1, [rbx+192]
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum
; Horizontal sum of zmm0
vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0
vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2
vhaddpd xmm0, xmm4, xmm4
Instead of doing the horizontal sum once for every sum128, it might be
more efficient (assuming the whole thing is not
cache-bandwidth-limited) to have the result of sum128 be a full SIMD
width, and then add them up with vaddpd instead of addsd, and do the horizontal sum once in the end.
But if the recursive part is to be programmed in Forth, we would need
a way to represent a SIMD width of data in Forth, maybe with a SIMD
stack. I see a few problems there:
* What to do about the mask registers of AVX-512? In the RISC-V
vector extension masks are stored in regular SIMD registers.
* There is a trend visible in ARM SVE and the RISC-V Vector extension
to have support for dealing with loops across longer vectors. Do we
also need to support something like that.
For the RISC-V vector extension, see <https://riscv.org/wp-content/uploads/2024/12/15.20-15.55-18.05.06.VEXT-bcn-v1.pdf>
One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.
An advantage of having a (short) SIMD stack would be that one could
use SIMD operations for other uses where the long-vector wordset looks
too heavy-weight (or would need optimizations to get rid of the
long-vector overhead). The question is if enough such uses exist to
justify adding such a stack.
- anton
On Sat, 19 Jul 2025 10:18:15 GMT[sum32][
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vmovapd zmm1, [rbx+128]
vaddpd zmm1, zmm1, [rbx+192]
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum
; Horizontal sum of zmm0
vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0
vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2
vhaddpd xmm0, xmm4, xmm4
the simd instructions does also take a memory operand
I can du sum128 as
code asum128b
movsd [r13-0x8], xmm0
lea r13, [r13-0x8]
vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vaddpd zmm0, zmm0, [rbx+128]
vaddpd zmm0, zmm0, [rbx+192]
vaddpd zmm0, zmm0, [rbx+256]
vaddpd zmm0, zmm0, [rbx+320]
vaddpd zmm0, zmm0, [rbx+384]
vaddpd zmm0, zmm0, [rbx+448]
vaddpd zmm0, zmm0, [rbx+512]
vaddpd zmm0, zmm0, [rbx+576]
vaddpd zmm0, zmm0, [rbx+640]
vaddpd zmm0, zmm0, [rbx+704]
vaddpd zmm0, zmm0, [rbx+768]
vaddpd zmm0, zmm0, [rbx+832]
vaddpd zmm0, zmm0, [rbx+896]
vaddpd zmm0, zmm0, [rbx+960]
Am 19.07.2025 um 12:18 schrieb Anton Ertl:
One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.
Not wanting to sound negative, but who in practice adds up long
vectors, apart from testing compilers and fp-arithmetic?
Dot products, on the other hand, are fundamental for many linear
algebra algorithms, eg. matrix multiplication and AI.
minforth <minforth@gmx.net> writes:
Am 19.07.2025 um 12:18 schrieb Anton Ertl:
One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.
Not wanting to sound negative, but who in practice adds up long
vectors, apart from testing compilers and fp-arithmetic?
Everyone who does dot-products.
Dot products, on the other hand, are fundamental for many linear
algebra algorithms, eg. matrix multiplication and AI.
If I add a vector-sum word
df+red ( dfv -- r )
\ r is the sum of the elements of dfv
to the vector wordset, then the dot-product is:
: dot-product ( dfv1 dfv2 -- r )
df*v df+red ;
Concerning matrix multiplication, while you can use the dot-product
for it, there are many other ways to do it, and some are more
efficient (although, admittedly, I have not used pairwise addition for
these ways).