• Vector sum (was: Parsing timestamps?)

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Jul 19 10:18:15 2025
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    I did a test coding the sum128 as a code word with avx-512 instructions
    and got the following results

    285,584,376 cycles:u
    941,856,077 instructions:u

    timing was
    timer-reset ' recursive-sum bench .elapsed 51 ms elapsed

    so half the time of the original recursive.
    with 32 zmm registers I could have done a sum256 also

    One could do sum128 with just 8 registers by performing the adds ASAP,
    i.e., for sum32

    vmovapd zmm0, [rbx]
    vmovapd zmm1, [rbx+64]
    vaddpd zmm0, zmm0, zmm1
    vmovapd zmm1, [rbx+128]
    vmovapd zmm2, [rbx+192]
    vaddpd zmm1, zmm1, zmm2
    vaddpd zmm0, zmm0, zmm1
    ; and then the Horizontal sum

    And you can code this as:

    vmovapd zmm0, [rbx]
    vaddpd zmm0, zmm0, [rbx+64]
    vmovapd zmm1, [rbx+128]
    vaddpd zmm1, zmm1, [rbx+192]
    vaddpd zmm0, zmm0, zmm1
    ; and then the Horizontal sum

    ; Horizontal sum of zmm0

    vextractf64x4 ymm1, zmm0, 1
    vaddpd ymm2, ymm1, ymm0

    vextractf64x2 xmm3, ymm2, 1
    vaddpd ymm4, ymm3, ymm2

    vhaddpd xmm0, xmm4, xmm4

    Instead of doing the horizontal sum once for every sum128, it might be
    more efficient (assuming the whole thing is not
    cache-bandwidth-limited) to have the result of sum128 be a full SIMD
    width, and then add them up with vaddpd instead of addsd, and do the
    horizontal sum once in the end.

    But if the recursive part is to be programmed in Forth, we would need
    a way to represent a SIMD width of data in Forth, maybe with a SIMD
    stack. I see a few problems there:

    * What to do about the mask registers of AVX-512? In the RISC-V
    vector extension masks are stored in regular SIMD registers.

    * There is a trend visible in ARM SVE and the RISC-V Vector extension
    to have support for dealing with loops across longer vectors. Do we
    also need to support something like that.

    For the RISC-V vector extension, see <https://riscv.org/wp-content/uploads/2024/12/15.20-15.55-18.05.06.VEXT-bcn-v1.pdf>

    One way to deal with all that would be to have a long-vector stack and
    have something like my vector wordset
    <https://github.com/AntonErtl/vectors>, where the sum of a vector
    would be a word that is implemented in some lower-level way (e.g.,
    assembly language); the sum of a vector is actually a planned, but not
    yet existing feature of this wordset.

    An advantage of having a (short) SIMD stack would be that one could
    use SIMD operations for other uses where the long-vector wordset looks
    too heavy-weight (or would need optimizations to get rid of the
    long-vector overhead). The question is if enough such uses exist to
    justify adding such a stack.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Sat Jul 19 13:53:20 2025
    From Newsgroup: comp.lang.forth

    Am 19.07.2025 um 12:18 schrieb Anton Ertl:

    One way to deal with all that would be to have a long-vector stack and
    have something like my vector wordset
    <https://github.com/AntonErtl/vectors>, where the sum of a vector
    would be a word that is implemented in some lower-level way (e.g.,
    assembly language); the sum of a vector is actually a planned, but not
    yet existing feature of this wordset.


    Not wanting to sound negative, but who in practice adds up long
    vectors, apart from testing compilers and fp-arithmetic?

    Dot products, on the other hand, are fundamental for many linear
    algebra algorithms, eg. matrix multiplication and AI.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Sat Jul 19 15:24:48 2025
    From Newsgroup: comp.lang.forth

    On Sat, 19 Jul 2025 10:18:15 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    peter <peter.noreply@tin.it> writes:
    I did a test coding the sum128 as a code word with avx-512 instructions
    and got the following results

    285,584,376 cycles:u
    941,856,077 instructions:u

    timing was
    timer-reset ' recursive-sum bench .elapsed 51 ms elapsed

    so half the time of the original recursive.
    with 32 zmm registers I could have done a sum256 also

    One could do sum128 with just 8 registers by performing the adds ASAP,
    i.e., for sum32

    vmovapd zmm0, [rbx]
    vmovapd zmm1, [rbx+64]
    vaddpd zmm0, zmm0, zmm1
    vmovapd zmm1, [rbx+128]
    vmovapd zmm2, [rbx+192]
    vaddpd zmm1, zmm1, zmm2
    vaddpd zmm0, zmm0, zmm1
    ; and then the Horizontal sum

    And you can code this as:

    vmovapd zmm0, [rbx]
    vaddpd zmm0, zmm0, [rbx+64]
    vmovapd zmm1, [rbx+128]
    vaddpd zmm1, zmm1, [rbx+192]
    vaddpd zmm0, zmm0, zmm1
    ; and then the Horizontal sum

    ; Horizontal sum of zmm0

    vextractf64x4 ymm1, zmm0, 1
    vaddpd ymm2, ymm1, ymm0

    vextractf64x2 xmm3, ymm2, 1
    vaddpd ymm4, ymm3, ymm2

    vhaddpd xmm0, xmm4, xmm4

    the simd instructions does also take a memory operand
    I can du sum128 as

    code asum128b

    movsd [r13-0x8], xmm0
    lea r13, [r13-0x8]

    vmovapd zmm0, [rbx]
    vaddpd zmm0, zmm0, [rbx+64]
    vaddpd zmm0, zmm0, [rbx+128]
    vaddpd zmm0, zmm0, [rbx+192]
    vaddpd zmm0, zmm0, [rbx+256]
    vaddpd zmm0, zmm0, [rbx+320]
    vaddpd zmm0, zmm0, [rbx+384]
    vaddpd zmm0, zmm0, [rbx+448]
    vaddpd zmm0, zmm0, [rbx+512]
    vaddpd zmm0, zmm0, [rbx+576]
    vaddpd zmm0, zmm0, [rbx+640]
    vaddpd zmm0, zmm0, [rbx+704]
    vaddpd zmm0, zmm0, [rbx+768]
    vaddpd zmm0, zmm0, [rbx+832]
    vaddpd zmm0, zmm0, [rbx+896]
    vaddpd zmm0, zmm0, [rbx+960]


    ; Horizontal sum of zmm0

    vextractf64x4 ymm1, zmm0, 1
    vaddpd ymm2, ymm1, ymm0

    vextractf64x2 xmm3, ymm2, 1
    vaddpd ymm4, ymm3, ymm2

    vpermilpd xmm5, xmm4, 1
    vaddsd xmm0, xmm4, xmm5


    ret
    end-code

    this compiles to 154 bytes and 25 instructions
    The original sum128 is 2157 bytes and 513 instructions!

    Yes the horizontal sum should just be done once.
    I have only replaced sum128 with simd as a test.
    Later I will do a complete example

    This asum128b does not change the timing but reduces
    the number of instructions

    277,333,790 cycles:u
    834,846,183 instructions:u # 3.01 insn per cycle



    Instead of doing the horizontal sum once for every sum128, it might be
    more efficient (assuming the whole thing is not
    cache-bandwidth-limited) to have the result of sum128 be a full SIMD
    width, and then add them up with vaddpd instead of addsd, and do the horizontal sum once in the end.

    But if the recursive part is to be programmed in Forth, we would need
    a way to represent a SIMD width of data in Forth, maybe with a SIMD
    stack. I see a few problems there:

    * What to do about the mask registers of AVX-512? In the RISC-V
    vector extension masks are stored in regular SIMD registers.

    * There is a trend visible in ARM SVE and the RISC-V Vector extension
    to have support for dealing with loops across longer vectors. Do we
    also need to support something like that.

    For the RISC-V vector extension, see <https://riscv.org/wp-content/uploads/2024/12/15.20-15.55-18.05.06.VEXT-bcn-v1.pdf>

    One way to deal with all that would be to have a long-vector stack and
    have something like my vector wordset
    <https://github.com/AntonErtl/vectors>, where the sum of a vector
    would be a word that is implemented in some lower-level way (e.g.,
    assembly language); the sum of a vector is actually a planned, but not
    yet existing feature of this wordset.

    An advantage of having a (short) SIMD stack would be that one could
    use SIMD operations for other uses where the long-vector wordset looks
    too heavy-weight (or would need optimizations to get rid of the
    long-vector overhead). The question is if enough such uses exist to
    justify adding such a stack.

    - anton

    I will take a look at your vector implementation and see if it can be used
    in lxf64

    BR
    Peter

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Jul 19 14:39:42 2025
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    On Sat, 19 Jul 2025 10:18:15 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    [sum32][
    vmovapd zmm0, [rbx]
    vaddpd zmm0, zmm0, [rbx+64]
    vmovapd zmm1, [rbx+128]
    vaddpd zmm1, zmm1, [rbx+192]
    vaddpd zmm0, zmm0, zmm1
    ; and then the Horizontal sum

    ; Horizontal sum of zmm0

    vextractf64x4 ymm1, zmm0, 1
    vaddpd ymm2, ymm1, ymm0

    vextractf64x2 xmm3, ymm2, 1
    vaddpd ymm4, ymm3, ymm2

    vhaddpd xmm0, xmm4, xmm4

    the simd instructions does also take a memory operand
    I can du sum128 as

    code asum128b

    movsd [r13-0x8], xmm0
    lea r13, [r13-0x8]

    vmovapd zmm0, [rbx]
    vaddpd zmm0, zmm0, [rbx+64]
    vaddpd zmm0, zmm0, [rbx+128]
    vaddpd zmm0, zmm0, [rbx+192]
    vaddpd zmm0, zmm0, [rbx+256]
    vaddpd zmm0, zmm0, [rbx+320]
    vaddpd zmm0, zmm0, [rbx+384]
    vaddpd zmm0, zmm0, [rbx+448]
    vaddpd zmm0, zmm0, [rbx+512]
    vaddpd zmm0, zmm0, [rbx+576]
    vaddpd zmm0, zmm0, [rbx+640]
    vaddpd zmm0, zmm0, [rbx+704]
    vaddpd zmm0, zmm0, [rbx+768]
    vaddpd zmm0, zmm0, [rbx+832]
    vaddpd zmm0, zmm0, [rbx+896]
    vaddpd zmm0, zmm0, [rbx+960]

    Yes, but that's not pairwise addition, so for these 16 adds you get
    worse avarage accuracy; if the CPU has limited OoO bufferering (maybe
    one of the Xeon Phis, but not anything modern that has AVX or
    AVX-512), you may also see some of the addition latency. You still
    get pairwise addition and its accuracy benefit for the horizontal sum
    and the recursive parts.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Jul 19 14:51:00 2025
    From Newsgroup: comp.lang.forth

    minforth <minforth@gmx.net> writes:
    Am 19.07.2025 um 12:18 schrieb Anton Ertl:

    One way to deal with all that would be to have a long-vector stack and
    have something like my vector wordset
    <https://github.com/AntonErtl/vectors>, where the sum of a vector
    would be a word that is implemented in some lower-level way (e.g.,
    assembly language); the sum of a vector is actually a planned, but not
    yet existing feature of this wordset.


    Not wanting to sound negative, but who in practice adds up long
    vectors, apart from testing compilers and fp-arithmetic?

    Everyone who does dot-products.

    Dot products, on the other hand, are fundamental for many linear
    algebra algorithms, eg. matrix multiplication and AI.

    If I add a vector-sum word

    df+red ( dfv -- r )
    \ r is the sum of the elements of dfv

    to the vector wordset, then the dot-product is:

    : dot-product ( dfv1 dfv2 -- r )
    df*v df+red ;

    Concerning matrix multiplication, while you can use the dot-product
    for it, there are many other ways to do it, and some are more
    efficient (although, admittedly, I have not used pairwise addition for
    these ways).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Sun Jul 20 06:27:53 2025
    From Newsgroup: comp.lang.forth

    Am 19.07.2025 um 16:51 schrieb Anton Ertl:
    minforth <minforth@gmx.net> writes:
    Am 19.07.2025 um 12:18 schrieb Anton Ertl:

    One way to deal with all that would be to have a long-vector stack and
    have something like my vector wordset
    <https://github.com/AntonErtl/vectors>, where the sum of a vector
    would be a word that is implemented in some lower-level way (e.g.,
    assembly language); the sum of a vector is actually a planned, but not
    yet existing feature of this wordset.


    Not wanting to sound negative, but who in practice adds up long
    vectors, apart from testing compilers and fp-arithmetic?

    Everyone who does dot-products.

    Dot products, on the other hand, are fundamental for many linear
    algebra algorithms, eg. matrix multiplication and AI.

    If I add a vector-sum word

    df+red ( dfv -- r )
    \ r is the sum of the elements of dfv

    to the vector wordset, then the dot-product is:

    : dot-product ( dfv1 dfv2 -- r )
    df*v df+red ;

    Sure, slow hand is not just for guitar players.
    With FMA, one could traverse the vectors in one go.

    https://docs.nvidia.com/cuda/floating-point/index.html

    Concerning matrix multiplication, while you can use the dot-product
    for it, there are many other ways to do it, and some are more
    efficient (although, admittedly, I have not used pairwise addition for
    these ways).

    There are tons of algorithms depending on various matrix properties.

    Then, given a desktop and a fat CPU, LAPACK et al. are your friends.
    Embedded or special CPU .. is a different story.






    --- Synchronet 3.21a-Linux NewsLink 1.2