Forum: Too Lazy BBS

Who's Online
Recent Visitors
- Kawasu
  Fri Oct 17 10:51:10 2025
  from Mena, Ar via Telnet
- Geek2
  Thu Oct 16 20:44:04 2025
  from Euclid, Oh via Telnet
- Kawasu
  Thu Oct 16 10:17:15 2025
  from Mena, Ar via Telnet
- Geek2
  Thu Oct 16 06:39:58 2025
  from Euclid, Oh via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	26
Nodes:	6 (0 / 6)
Uptime:	59:24:07
Calls:	633
Calls today:	1
Files:	1,188
D/L today:	32 files (20,076K bytes)
Messages:	180,583

Vector sum (was: Parsing timestamps?)

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Jul 19 10:18:15 2025

From Newsgroup: comp.lang.forth

peter <peter.noreply@tin.it> writes:

I did a test coding the sum128 as a code word with avx-512 instructions
and got the following results

285,584,376 cycles:u
941,856,077 instructions:u

timing was
timer-reset ' recursive-sum bench .elapsed 51 ms elapsed

so half the time of the original recursive.
with 32 zmm registers I could have done a sum256 also

One could do sum128 with just 8 registers by performing the adds ASAP,
i.e., for sum32

vmovapd zmm0, [rbx]
vmovapd zmm1, [rbx+64]
vaddpd zmm0, zmm0, zmm1
vmovapd zmm1, [rbx+128]
vmovapd zmm2, [rbx+192]
vaddpd zmm1, zmm1, zmm2
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum

And you can code this as:

vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vmovapd zmm1, [rbx+128]
vaddpd zmm1, zmm1, [rbx+192]
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum

; Horizontal sum of zmm0

vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0

vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2

vhaddpd xmm0, xmm4, xmm4

Instead of doing the horizontal sum once for every sum128, it might be
more efficient (assuming the whole thing is not
cache-bandwidth-limited) to have the result of sum128 be a full SIMD
width, and then add them up with vaddpd instead of addsd, and do the
horizontal sum once in the end.

But if the recursive part is to be programmed in Forth, we would need
a way to represent a SIMD width of data in Forth, maybe with a SIMD
stack. I see a few problems there:

* What to do about the mask registers of AVX-512? In the RISC-V
vector extension masks are stored in regular SIMD registers.

* There is a trend visible in ARM SVE and the RISC-V Vector extension
to have support for dealing with loops across longer vectors. Do we
also need to support something like that.

For the RISC-V vector extension, see <https://riscv.org/wp-content/uploads/2024/12/15.20-15.55-18.05.06.VEXT-bcn-v1.pdf>

One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.

An advantage of having a (short) SIMD stack would be that one could
use SIMD operations for other uses where the long-vector wordset looks
too heavy-weight (or would need optimizations to get rid of the
long-vector overhead). The question is if enough such uses exist to
justify adding such a stack.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
--- Synchronet 3.21a-Linux NewsLink 1.2

From minforth@minforth@gmx.net to comp.lang.forth on Sat Jul 19 13:53:20 2025

From Newsgroup: comp.lang.forth

Am 19.07.2025 um 12:18 schrieb Anton Ertl:

One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.

Not wanting to sound negative, but who in practice adds up long
vectors, apart from testing compilers and fp-arithmetic?

Dot products, on the other hand, are fundamental for many linear
algebra algorithms, eg. matrix multiplication and AI.

--- Synchronet 3.21a-Linux NewsLink 1.2

From peter@peter.noreply@tin.it to comp.lang.forth on Sat Jul 19 15:24:48 2025

From Newsgroup: comp.lang.forth

On Sat, 19 Jul 2025 10:18:15 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

peter <peter.noreply@tin.it> writes:

I did a test coding the sum128 as a code word with avx-512 instructions
and got the following results

285,584,376 cycles:u
941,856,077 instructions:u

timing was
timer-reset ' recursive-sum bench .elapsed 51 ms elapsed

so half the time of the original recursive.
with 32 zmm registers I could have done a sum256 also

One could do sum128 with just 8 registers by performing the adds ASAP,
i.e., for sum32

vmovapd zmm0, [rbx]
vmovapd zmm1, [rbx+64]
vaddpd zmm0, zmm0, zmm1
vmovapd zmm1, [rbx+128]
vmovapd zmm2, [rbx+192]
vaddpd zmm1, zmm1, zmm2
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum

And you can code this as:

vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vmovapd zmm1, [rbx+128]
vaddpd zmm1, zmm1, [rbx+192]
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum

; Horizontal sum of zmm0

vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0

vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2

vhaddpd xmm0, xmm4, xmm4

the simd instructions does also take a memory operand
I can du sum128 as

code asum128b

movsd [r13-0x8], xmm0
lea r13, [r13-0x8]

vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vaddpd zmm0, zmm0, [rbx+128]
vaddpd zmm0, zmm0, [rbx+192]
vaddpd zmm0, zmm0, [rbx+256]
vaddpd zmm0, zmm0, [rbx+320]
vaddpd zmm0, zmm0, [rbx+384]
vaddpd zmm0, zmm0, [rbx+448]
vaddpd zmm0, zmm0, [rbx+512]
vaddpd zmm0, zmm0, [rbx+576]
vaddpd zmm0, zmm0, [rbx+640]
vaddpd zmm0, zmm0, [rbx+704]
vaddpd zmm0, zmm0, [rbx+768]
vaddpd zmm0, zmm0, [rbx+832]
vaddpd zmm0, zmm0, [rbx+896]
vaddpd zmm0, zmm0, [rbx+960]

; Horizontal sum of zmm0

vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0

vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2

vpermilpd xmm5, xmm4, 1
vaddsd xmm0, xmm4, xmm5

ret
end-code

this compiles to 154 bytes and 25 instructions
The original sum128 is 2157 bytes and 513 instructions!

Yes the horizontal sum should just be done once.
I have only replaced sum128 with simd as a test.
Later I will do a complete example

This asum128b does not change the timing but reduces
the number of instructions

277,333,790 cycles:u
834,846,183 instructions:u # 3.01 insn per cycle

Instead of doing the horizontal sum once for every sum128, it might be
more efficient (assuming the whole thing is not
cache-bandwidth-limited) to have the result of sum128 be a full SIMD
width, and then add them up with vaddpd instead of addsd, and do the horizontal sum once in the end.

But if the recursive part is to be programmed in Forth, we would need
a way to represent a SIMD width of data in Forth, maybe with a SIMD
stack. I see a few problems there:

* What to do about the mask registers of AVX-512? In the RISC-V
vector extension masks are stored in regular SIMD registers.

* There is a trend visible in ARM SVE and the RISC-V Vector extension
to have support for dealing with loops across longer vectors. Do we
also need to support something like that.

For the RISC-V vector extension, see <https://riscv.org/wp-content/uploads/2024/12/15.20-15.55-18.05.06.VEXT-bcn-v1.pdf>

One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.

An advantage of having a (short) SIMD stack would be that one could
use SIMD operations for other uses where the long-vector wordset looks
too heavy-weight (or would need optimizations to get rid of the
long-vector overhead). The question is if enough such uses exist to
justify adding such a stack.

- anton

I will take a look at your vector implementation and see if it can be used
in lxf64

BR
Peter

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Jul 19 14:39:42 2025

From Newsgroup: comp.lang.forth

peter <peter.noreply@tin.it> writes:

On Sat, 19 Jul 2025 10:18:15 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

[sum32][

vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vmovapd zmm1, [rbx+128]
vaddpd zmm1, zmm1, [rbx+192]
vaddpd zmm0, zmm0, zmm1
; and then the Horizontal sum

; Horizontal sum of zmm0

vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0

vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2

vhaddpd xmm0, xmm4, xmm4

the simd instructions does also take a memory operand
I can du sum128 as

code asum128b

movsd [r13-0x8], xmm0
lea r13, [r13-0x8]

vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vaddpd zmm0, zmm0, [rbx+128]
vaddpd zmm0, zmm0, [rbx+192]
vaddpd zmm0, zmm0, [rbx+256]
vaddpd zmm0, zmm0, [rbx+320]
vaddpd zmm0, zmm0, [rbx+384]
vaddpd zmm0, zmm0, [rbx+448]
vaddpd zmm0, zmm0, [rbx+512]
vaddpd zmm0, zmm0, [rbx+576]
vaddpd zmm0, zmm0, [rbx+640]
vaddpd zmm0, zmm0, [rbx+704]
vaddpd zmm0, zmm0, [rbx+768]
vaddpd zmm0, zmm0, [rbx+832]
vaddpd zmm0, zmm0, [rbx+896]
vaddpd zmm0, zmm0, [rbx+960]

Yes, but that's not pairwise addition, so for these 16 adds you get
worse avarage accuracy; if the CPU has limited OoO bufferering (maybe
one of the Xeon Phis, but not anything modern that has AVX or
AVX-512), you may also see some of the addition latency. You still
get pairwise addition and its accuracy benefit for the horizontal sum
and the recursive parts.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Jul 19 14:51:00 2025

From Newsgroup: comp.lang.forth

minforth <minforth@gmx.net> writes:

Am 19.07.2025 um 12:18 schrieb Anton Ertl:

One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.

Not wanting to sound negative, but who in practice adds up long
vectors, apart from testing compilers and fp-arithmetic?

Everyone who does dot-products.

Dot products, on the other hand, are fundamental for many linear
algebra algorithms, eg. matrix multiplication and AI.

If I add a vector-sum word

df+red ( dfv -- r )
\ r is the sum of the elements of dfv

to the vector wordset, then the dot-product is:

: dot-product ( dfv1 dfv2 -- r )
df*v df+red ;

Concerning matrix multiplication, while you can use the dot-product
for it, there are many other ways to do it, and some are more
efficient (although, admittedly, I have not used pairwise addition for
these ways).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
--- Synchronet 3.21a-Linux NewsLink 1.2

From minforth@minforth@gmx.net to comp.lang.forth on Sun Jul 20 06:27:53 2025

From Newsgroup: comp.lang.forth

Am 19.07.2025 um 16:51 schrieb Anton Ertl:

minforth <minforth@gmx.net> writes:

Am 19.07.2025 um 12:18 schrieb Anton Ertl:

One way to deal with all that would be to have a long-vector stack and
have something like my vector wordset
<https://github.com/AntonErtl/vectors>, where the sum of a vector
would be a word that is implemented in some lower-level way (e.g.,
assembly language); the sum of a vector is actually a planned, but not
yet existing feature of this wordset.

Not wanting to sound negative, but who in practice adds up long
vectors, apart from testing compilers and fp-arithmetic?

Everyone who does dot-products.

Dot products, on the other hand, are fundamental for many linear
algebra algorithms, eg. matrix multiplication and AI.

If I add a vector-sum word

df+red ( dfv -- r )
\ r is the sum of the elements of dfv

to the vector wordset, then the dot-product is:

: dot-product ( dfv1 dfv2 -- r )
df*v df+red ;

Sure, slow hand is not just for guitar players.
With FMA, one could traverse the vectors in one go.

https://docs.nvidia.com/cuda/floating-point/index.html

Concerning matrix multiplication, while you can use the dot-product
for it, there are many other ways to do it, and some are more
efficient (although, admittedly, I have not used pairwise addition for
these ways).

There are tons of algorithms depending on various matrix properties.

Then, given a desktop and a fat CPU, LAPACK et al. are your friends.
Embedded or special CPU .. is a different story.

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

Recent Visitors

System Info

Vector sum (was: Parsing timestamps?)