• More real world matmul kernel Was: Matmul in VVM (was: Sane(r) SIMD)

    From Michael S@already5chosen@yahoo.com to comp.arch on Sun May 3 18:30:56 2026
    From Newsgroup: comp.arch

    On Sun, 03 May 2026 14:39:54 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    In order to convince gcc to generate AVX-512 on skylake-avx512 I had
    to go back to gcc7.

    skylake-avx512: recognized since clang-3.9, generates semi-reasonable >avx512 code with clang-3.9 to 9.

    gcc-8.1 was released on May 2, 2018.

    clang-10.0 was released on 24 March 2020.

    Alder Lake was released in late 2021, but initially had some AVX-512
    support that was disabled via firmware later.

    Given how undecisive Intel has been on Alder Lake, it seems unlikely
    that they eliminated the AVX-512 usage already in gcc-8 so early to
    avoid disappointments when Alder Lake is released. But somehow I
    cannot come up with a different explanation.

    In this particular case it could be a blessing for Intel, because
    newer clang code generated for znver5 used 512-bit SIMD but looks
    horrible. I fully expect that it is slower than more conservative
    code for Intel targets.


    I wrote less artificial kernel, one that is actually very similar to how
    I'd do inner loop of matmul of big matrices on SIMD512 target machine.

    https://godbolt.org/z/aYhxcTehq

    I am very impressed by the quality of code that gcc generated for Zen4
    and Zen5. Exactly the same code would be excellent fit on any Intel
    AVX512 target, but especially so on Intel processors with dual 512b
    pipes.
    But gcc does not generate this code for any Intel target. That's
    extremely weird.



    The whole auto-vectorization stuff tends to be pretty unreliable
    overall. Sometimes the code that is generated looks good, sometimes
    it is a mess, sometimes there is no auto-vectorization at all.

    - anton

    Very true.

    --- Synchronet 3.21f-Linux NewsLink 1.2