• primes.pl mainly tests the Prolog ALU [mod/2 vs rem/2]

    From Mild Shock@janburse@fastmail.fm to sci.math on Wed Oct 15 04:34:57 2025
    From Newsgroup: sci.math

    Hi,

    I spent some time thinking about my primes.pl
    test. And came to the conclusion that it
    mainly tests the Prolog ALU. Things like

    integer successor or integer modulo. Then
    I found that Java has Math.floorMod() which
    I wasn't using yet. And peng results are better:

    /* Dogelog Player 2.1.2 for Java, today */
    ?- time(test).
    % Zeit 286 ms, GC 1 ms, Lips 26302430, Uhr 15.10.2025 02:31
    true.

    Maybe the Java backend picks a CPU instruction
    for Math.floorMod() instead of executing the
    longer code sequence that is needed to correct

    rem/2 into mod/2. Who knows. I also reorganized
    the code a little bit, and eliminated an extra
    method call in all arithmetic functions, by

    inlining the arithmetic function body in the
    evaluable predicate definition code. Comparison
    to old measurements and some measurements of

    other Prolog systems:

    /* Dogelog Player 2.1.2 for Java, weeks ago */
    ?- time(test).
    % Zeit 378 ms, GC 1 ms, Lips 19900780, Uhr 28.08.2025 17:44
    true.

    /* SWI-Prolog 9.0.4 */
    ?- time(test).
    % 7,506,639 inferences, 0.363 CPU in 0.362 seconds
    (100% CPU, 20693560 Lips)
    true.

    /* Scryer Prolog 0.9.4-639 */
    ?- time(test).
    % CPU time: 0.365s, 7_517_613 inferences
    true.

    /* Trealla Prolog 2.82.23-3 */
    ?- time(test).
    % Time elapsed 0.868s, 11263917 Inferences, 12.983 MLips
    true.

    Bye

    P.S.: The code uses the hated mathematical mod/2,
    and not the cheaper rem/2 that CPUs usually have:

    test :-
    len(L, 1000),
    primes(L, _).

    primes([], 1).
    primes([J|L], J) :-
    primes(L, I),
    K is I+1,
    search(L, K, J).

    search(L, I, J) :-
    mem(X, L),
    I mod X =:= 0, !,
    K is I+1,
    search(L, K, J).
    search(_, I, I).

    mem(X, [X|_]).
    mem(X, [_|Y]) :-
    mem(X, Y).

    len([], 0) :- !.
    len([_|L], N) :-
    N > 0,
    M is N-1,
    len(L, M).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Mild Shock@janburse@fastmail.fm to sci.math on Wed Oct 15 04:37:23 2025
    From Newsgroup: sci.math

    Hi,

    The change from 378 ms to 286 ms is around 25-30%
    is insane. But I did both tests on a novel AI CPU.
    To be precise on a AMD Ryzen AI 7 350.

    But somehow I picked up rumors that AI CPUs now
    might do Neural Network Branch Prediction. The
    idea seems to exist in hardware at least since (2012):

    Machine learning and artificial intelligence are
    the current hype (again). In their new Ryzen
    processors, AMD advertises the Neural Net
    Prediction. It turns out this is was already
    used in their older (2012) Piledriver architecture
    used for example in the AMD A10-4600M. It is also
    present in recent Samsung processors such as the
    one powering the Galaxy S7. What is it really? https://chasethedevil.github.io/post/the_neural_network_in_your_cpu/

    It can be done with Convoluted Neural Networks (CNN):

    BranchNet: A Convolutional Neural Network to
    Predict Hard-To-Predict Branches
    To this end, Tarsa et al. proposed using convolutional
    neural networks (CNNs) that are trained at
    compiletime to accurately predict branches that
    TAGE cannot. Given enough profiling coverage, CNNs
    learn input-independent branch correlations. https://microarch.org/micro53/papers/738300a118.pdf

    Interstingly the above shows cases a PGO based
    Machine Learning for Branch Predictors. No clue
    how they construct the CPU, that they can feed

    it with offline constructed neural neutworks for
    their own execution. Maybe an optimizer uses it?
    But I guess a more modern solutions would not only

    use CNN, but also an Attention Mechanism.

    Bye

    Mild Shock schrieb:
    Hi,

    I spent some time thinking about my primes.pl
    test. And came to the conclusion that it
    mainly tests the Prolog ALU. Things like

    integer successor or integer modulo. Then
    I found that Java has Math.floorMod() which
    I wasn't using yet. And peng results are better:

    /* Dogelog Player 2.1.2 for Java, today */
    ?- time(test).
    % Zeit 286 ms, GC 1 ms, Lips 26302430, Uhr 15.10.2025 02:31
    true.

    Maybe the Java backend picks a CPU instruction
    for Math.floorMod() instead of executing the
    longer code sequence that is needed to correct

    rem/2 into mod/2. Who knows. I also reorganized
    the code a little bit, and eliminated an extra
    method call in all arithmetic functions, by

    inlining the arithmetic function body in the
    evaluable predicate definition code. Comparison
    to old measurements and some measurements of

    other Prolog systems:

    /* Dogelog Player 2.1.2 for Java, weeks ago */
    ?- time(test).
    % Zeit 378 ms, GC 1 ms, Lips 19900780, Uhr 28.08.2025 17:44
    true.

    /* SWI-Prolog 9.0.4 */
    ?- time(test).
    % 7,506,639 inferences, 0.363 CPU in 0.362 seconds
    (100% CPU, 20693560 Lips)
    true.

    /* Scryer Prolog 0.9.4-639 */
    ?- time(test).
    % CPU time: 0.365s, 7_517_613 inferences
    true.

    /* Trealla Prolog 2.82.23-3 */
    ?- time(test).
    % Time elapsed 0.868s, 11263917 Inferences, 12.983 MLips
    true.

    Bye

    P.S.: The code uses the hated mathematical mod/2,
    and not the cheaper rem/2 that CPUs usually have:

    test :-
    -a-a len(L, 1000),
    -a-a primes(L, _).

    primes([], 1).
    primes([J|L], J) :-
    -a-a primes(L, I),
    -a-a K is I+1,
    -a-a search(L, K, J).

    search(L, I, J) :-
    -a-a mem(X, L),
    -a-a I mod X =:= 0, !,
    -a-a K is I+1,
    -a-a search(L, K, J).
    search(_, I, I).

    mem(X, [X|_]).
    mem(X, [_|Y]) :-
    -a-a mem(X, Y).

    len([], 0) :- !.
    len([_|L], N) :-
    -a-a N > 0,
    -a-a M is N-1,
    -a-a len(L, M).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Mild Shock@janburse@fastmail.fm to sci.math on Wed Oct 15 16:06:27 2025
    From Newsgroup: sci.math

    Hi,

    It seems I am having problems pacing with
    all the new fancy toys. Wasn't able to really
    benchmark my NPU from a Desktop AI machine,

    picked the wrong driver. Need to try again.
    What worked was benchmarking Mobile AI machines.
    I just grabbed Geekbench AI and some devices:

    USA Fab, M4:

    sANN hANN qANN
    iPad CPU 4848 7947 6353
    iPad GPU 9752 11383 10051
    iPad NPU 4873 36544 *51634*

    China Fab, Snapdragon:

    sANN hANN qANN
    Redmi CPU 1044 950 1723
    Redmi GPU 480 905 737
    Redmi NNAPI 205 205 469
    Redmi QNN 226 226 *10221*

    Speed-Up via NPU is factor 10x. See the column
    qANN which means quantizised artificial neural
    networks, when NPU or QNN is picked.

    The mobile AI NPUs are optimized using
    mimimal amounts of energy, and minimal amounts
    of space squeezing (distilling) everything

    into INT8 and INT4.

    Bye

    Mild Shock schrieb:
    Hi,

    The change from 378 ms to 286 ms is around 25-30%
    is insane. But I did both tests on a novel AI CPU.
    To be precise on a AMD Ryzen AI 7 350.

    But somehow I picked up rumors that AI CPUs now
    might do Neural Network Branch Prediction. The
    idea seems to exist in hardware at least since (2012):

    Machine learning and artificial intelligence are
    the current hype (again). In their new Ryzen
    processors, AMD advertises the Neural Net
    Prediction. It turns out this is was already
    used in their older (2012) Piledriver architecture
    used for example in the AMD A10-4600M. It is also
    present in recent Samsung processors such as the
    one powering the Galaxy S7. What is it really? https://chasethedevil.github.io/post/the_neural_network_in_your_cpu/

    It can be done with Convoluted Neural Networks (CNN):

    BranchNet: A Convolutional Neural Network to
    Predict Hard-To-Predict Branches
    To this end, Tarsa et al. proposed using convolutional
    neural networks (CNNs) that are trained at
    compiletime to accurately predict branches that
    TAGE cannot. Given enough profiling coverage, CNNs
    learn input-independent branch correlations. https://microarch.org/micro53/papers/738300a118.pdf

    Interstingly the above shows cases a PGO based
    Machine Learning for Branch Predictors. No clue
    how they construct the CPU, that they can feed

    it with offline constructed neural neutworks for
    their own execution. Maybe an optimizer uses it?
    But I guess a more modern-a solutions would not only

    use CNN, but also an Attention Mechanism.

    Bye

    Mild Shock schrieb:
    Hi,

    I spent some time thinking about my primes.pl
    test. And came to the conclusion that it
    mainly tests the Prolog ALU. Things like

    integer successor or integer modulo. Then
    I found that Java has Math.floorMod() which
    I wasn't using yet. And peng results are better:

    /* Dogelog Player 2.1.2 for Java, today */
    ?- time(test).
    % Zeit 286 ms, GC 1 ms, Lips 26302430, Uhr 15.10.2025 02:31
    true.

    Maybe the Java backend picks a CPU instruction
    for Math.floorMod() instead of executing the
    longer code sequence that is needed to correct

    rem/2 into mod/2. Who knows. I also reorganized
    the code a little bit, and eliminated an extra
    method call in all arithmetic functions, by

    inlining the arithmetic function body in the
    evaluable predicate definition code. Comparison
    to old measurements and some measurements of

    other Prolog systems:

    /* Dogelog Player 2.1.2 for Java, weeks ago */
    ?- time(test).
    % Zeit 378 ms, GC 1 ms, Lips 19900780, Uhr 28.08.2025 17:44
    true.

    /* SWI-Prolog 9.0.4 */
    ?- time(test).
    % 7,506,639 inferences, 0.363 CPU in 0.362 seconds
    (100% CPU, 20693560 Lips)
    true.

    /* Scryer Prolog 0.9.4-639 */
    ?- time(test).
    % CPU time: 0.365s, 7_517_613 inferences
    true.

    /* Trealla Prolog 2.82.23-3 */
    ?- time(test).
    % Time elapsed 0.868s, 11263917 Inferences, 12.983 MLips
    true.

    Bye

    P.S.: The code uses the hated mathematical mod/2,
    and not the cheaper rem/2 that CPUs usually have:

    test :-
    -a-a-a len(L, 1000),
    -a-a-a primes(L, _).

    primes([], 1).
    primes([J|L], J) :-
    -a-a-a primes(L, I),
    -a-a-a K is I+1,
    -a-a-a search(L, K, J).

    search(L, I, J) :-
    -a-a-a mem(X, L),
    -a-a-a I mod X =:= 0, !,
    -a-a-a K is I+1,
    -a-a-a search(L, K, J).
    search(_, I, I).

    mem(X, [X|_]).
    mem(X, [_|Y]) :-
    -a-a-a mem(X, Y).

    len([], 0) :- !.
    len([_|L], N) :-
    -a-a-a N > 0,
    -a-a-a M is N-1,
    -a-a-a len(L, M).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Mild Shock@janburse@fastmail.fm to sci.math on Wed Oct 15 16:13:10 2025
    From Newsgroup: sci.math

    Hi,

    But not only Mobie AI and Desktop AI are making
    a broader imprint now. We might also experience
    Workstation AI, with a 3'000.- USD price tag:

    You Can't Buy This... Yet! The NVIDIA GB10 from Dell
    The New Superchip that Terrifies the Cloud! https://www.youtube.com/watch?v=x1qViw4xyVo

    So whats going on? I was asking Phind, which is
    driven by a 70B model tailored towards developers:

    Q: Is there an AI inflection point right now ,
    with NPUs in mobile, desktop and workstation

    A: Evidence of the Inflection Point

    - Mobile Leadership
    NPUs originated in smartphones
    Now becoming ubiquitous across all device types
    Enabling sophisticated AI features at consumer price points

    - Desktop Revolution
    Major manufacturers implementing NPUs across product lines
    Apple's Neural Engine integrated into M-series chips
    Qualcomm, Intel, and AMD incorporating AI accelerators

    - Workstation Transformation
    Professional-grade NPUs in mobile workstations
    Demonstrated superior performance for AI-specific tasks
    Enabling local processing of previously cloud-dependent workloads

    https://www.phind.com/search/cmgs1s6jv00023h67g5z2aaa0

    Bye

    Mild Shock schrieb:
    Hi,

    It seems I am having problems pacing with
    all the new fancy toys. Wasn't able to really
    benchmark my NPU from a Desktop AI machine,

    picked the wrong driver. Need to try again.
    What worked was benchmarking Mobile AI machines.
    I just grabbed Geekbench AI and some devices:

    USA Fab, M4:

    -a-a-a sANN-a-a-a hANN-a-a-a qANN
    iPad CPU-a-a-a 4848-a-a-a 7947-a-a-a 6353
    iPad GPU-a-a-a 9752-a-a-a 11383-a-a-a 10051
    iPad NPU-a-a-a 4873-a-a-a 36544-a-a-a *51634*

    China Fab, Snapdragon:

    -a-a-a sANN-a-a-a hANN-a-a-a qANN
    Redmi CPU-a-a-a 1044-a-a-a 950-a-a-a 1723
    Redmi GPU-a-a-a 480-a-a-a 905-a-a-a 737
    Redmi NNAPI-a-a-a 205-a-a-a 205-a-a-a 469
    Redmi QNN-a-a-a 226-a-a-a 226-a-a-a *10221*

    Speed-Up via NPU is factor 10x. See the column
    qANN which means quantizised artificial neural
    networks, when NPU or QNN is picked.

    The mobile AI NPUs are optimized using
    mimimal amounts of energy, and minimal amounts
    of space squeezing (distilling) everything

    into INT8 and INT4.

    Bye

    Mild Shock schrieb:
    Hi,

    The change from 378 ms to 286 ms is around 25-30%
    is insane. But I did both tests on a novel AI CPU.
    To be precise on a AMD Ryzen AI 7 350.

    But somehow I picked up rumors that AI CPUs now
    might do Neural Network Branch Prediction. The
    idea seems to exist in hardware at least since (2012):

    Machine learning and artificial intelligence are
    the current hype (again). In their new Ryzen
    processors, AMD advertises the Neural Net
    Prediction. It turns out this is was already
    used in their older (2012) Piledriver architecture
    used for example in the AMD A10-4600M. It is also
    present in recent Samsung processors such as the
    one powering the Galaxy S7. What is it really?
    https://chasethedevil.github.io/post/the_neural_network_in_your_cpu/

    It can be done with Convoluted Neural Networks (CNN):

    BranchNet: A Convolutional Neural Network to
    Predict Hard-To-Predict Branches
    To this end, Tarsa et al. proposed using convolutional
    neural networks (CNNs) that are trained at
    compiletime to accurately predict branches that
    TAGE cannot. Given enough profiling coverage, CNNs
    learn input-independent branch correlations.
    https://microarch.org/micro53/papers/738300a118.pdf

    Interstingly the above shows cases a PGO based
    Machine Learning for Branch Predictors. No clue
    how they construct the CPU, that they can feed

    it with offline constructed neural neutworks for
    their own execution. Maybe an optimizer uses it?
    But I guess a more modern-a solutions would not only

    use CNN, but also an Attention Mechanism.

    Bye

    Mild Shock schrieb:
    Hi,

    I spent some time thinking about my primes.pl
    test. And came to the conclusion that it
    mainly tests the Prolog ALU. Things like

    integer successor or integer modulo. Then
    I found that Java has Math.floorMod() which
    I wasn't using yet. And peng results are better:

    /* Dogelog Player 2.1.2 for Java, today */
    ?- time(test).
    % Zeit 286 ms, GC 1 ms, Lips 26302430, Uhr 15.10.2025 02:31
    true.

    Maybe the Java backend picks a CPU instruction
    for Math.floorMod() instead of executing the
    longer code sequence that is needed to correct

    rem/2 into mod/2. Who knows. I also reorganized
    the code a little bit, and eliminated an extra
    method call in all arithmetic functions, by

    inlining the arithmetic function body in the
    evaluable predicate definition code. Comparison
    to old measurements and some measurements of

    other Prolog systems:

    /* Dogelog Player 2.1.2 for Java, weeks ago */
    ?- time(test).
    % Zeit 378 ms, GC 1 ms, Lips 19900780, Uhr 28.08.2025 17:44
    true.

    /* SWI-Prolog 9.0.4 */
    ?- time(test).
    % 7,506,639 inferences, 0.363 CPU in 0.362 seconds
    (100% CPU, 20693560 Lips)
    true.

    /* Scryer Prolog 0.9.4-639 */
    ?- time(test).
    % CPU time: 0.365s, 7_517_613 inferences
    true.

    /* Trealla Prolog 2.82.23-3 */
    ?- time(test).
    % Time elapsed 0.868s, 11263917 Inferences, 12.983 MLips
    true.

    Bye

    P.S.: The code uses the hated mathematical mod/2,
    and not the cheaper rem/2 that CPUs usually have:

    test :-
    -a-a-a len(L, 1000),
    -a-a-a primes(L, _).

    primes([], 1).
    primes([J|L], J) :-
    -a-a-a primes(L, I),
    -a-a-a K is I+1,
    -a-a-a search(L, K, J).

    search(L, I, J) :-
    -a-a-a mem(X, L),
    -a-a-a I mod X =:= 0, !,
    -a-a-a K is I+1,
    -a-a-a search(L, K, J).
    search(_, I, I).

    mem(X, [X|_]).
    mem(X, [_|Y]) :-
    -a-a-a mem(X, Y).

    len([], 0) :- !.
    len([_|L], N) :-
    -a-a-a N > 0,
    -a-a-a M is N-1,
    -a-a-a len(L, M).



    --- Synchronet 3.21a-Linux NewsLink 1.2