Forum: Too Lazy BBS

Who's Online
Recent Visitors
- Kawasu
  Thu Oct 16 10:17:15 2025
  from Mena, Ar via Telnet
- Geek2
  Thu Oct 16 06:39:58 2025
  from Euclid, Oh via Telnet
- Amr
  Tue Oct 14 21:13:21 2025
  from Fayetteville, Nc via Telnet
- Amr
  Tue Oct 14 20:34:34 2025
  from Fayetteville, Nc via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	27
Nodes:	6 (0 / 6)
Uptime:	35:56:30
Calls:	631
Calls today:	2
Files:	1,187
D/L today:	22 files (29,767K bytes)
Messages:	172,999

primes.pl mainly tests the Prolog ALU [mod/2 vs rem/2]

From Mild Shock@janburse@fastmail.fm to sci.math on Wed Oct 15 04:34:57 2025

From Newsgroup: sci.math

Hi,

I spent some time thinking about my primes.pl
test. And came to the conclusion that it
mainly tests the Prolog ALU. Things like

integer successor or integer modulo. Then
I found that Java has Math.floorMod() which
I wasn't using yet. And peng results are better:

/* Dogelog Player 2.1.2 for Java, today */
?- time(test).
% Zeit 286 ms, GC 1 ms, Lips 26302430, Uhr 15.10.2025 02:31
true.

Maybe the Java backend picks a CPU instruction
for Math.floorMod() instead of executing the
longer code sequence that is needed to correct

rem/2 into mod/2. Who knows. I also reorganized
the code a little bit, and eliminated an extra
method call in all arithmetic functions, by

inlining the arithmetic function body in the
evaluable predicate definition code. Comparison
to old measurements and some measurements of

other Prolog systems:

/* Dogelog Player 2.1.2 for Java, weeks ago */
?- time(test).
% Zeit 378 ms, GC 1 ms, Lips 19900780, Uhr 28.08.2025 17:44
true.

/* SWI-Prolog 9.0.4 */
?- time(test).
% 7,506,639 inferences, 0.363 CPU in 0.362 seconds
(100% CPU, 20693560 Lips)
true.

/* Scryer Prolog 0.9.4-639 */
?- time(test).
% CPU time: 0.365s, 7_517_613 inferences
true.

/* Trealla Prolog 2.82.23-3 */
?- time(test).
% Time elapsed 0.868s, 11263917 Inferences, 12.983 MLips
true.

Bye

P.S.: The code uses the hated mathematical mod/2,
and not the cheaper rem/2 that CPUs usually have:

test :-
len(L, 1000),
primes(L, _).

primes([], 1).
primes([J|L], J) :-
primes(L, I),
K is I+1,
search(L, K, J).

search(L, I, J) :-
mem(X, L),
I mod X =:= 0, !,
K is I+1,
search(L, K, J).
search(_, I, I).

mem(X, [X|_]).
mem(X, [_|Y]) :-
mem(X, Y).

len([], 0) :- !.
len([_|L], N) :-
N > 0,
M is N-1,
len(L, M).
--- Synchronet 3.21a-Linux NewsLink 1.2

From Mild Shock@janburse@fastmail.fm to sci.math on Wed Oct 15 04:37:23 2025

From Newsgroup: sci.math

Hi,

The change from 378 ms to 286 ms is around 25-30%
is insane. But I did both tests on a novel AI CPU.
To be precise on a AMD Ryzen AI 7 350.

But somehow I picked up rumors that AI CPUs now
might do Neural Network Branch Prediction. The
idea seems to exist in hardware at least since (2012):

Machine learning and artificial intelligence are
the current hype (again). In their new Ryzen
processors, AMD advertises the Neural Net
Prediction. It turns out this is was already
used in their older (2012) Piledriver architecture
used for example in the AMD A10-4600M. It is also
present in recent Samsung processors such as the
one powering the Galaxy S7. What is it really? https://chasethedevil.github.io/post/the_neural_network_in_your_cpu/

It can be done with Convoluted Neural Networks (CNN):

BranchNet: A Convolutional Neural Network to
Predict Hard-To-Predict Branches
To this end, Tarsa et al. proposed using convolutional
neural networks (CNNs) that are trained at
compiletime to accurately predict branches that
TAGE cannot. Given enough profiling coverage, CNNs
learn input-independent branch correlations. https://microarch.org/micro53/papers/738300a118.pdf

Interstingly the above shows cases a PGO based
Machine Learning for Branch Predictors. No clue
how they construct the CPU, that they can feed

it with offline constructed neural neutworks for
their own execution. Maybe an optimizer uses it?
But I guess a more modern solutions would not only

use CNN, but also an Attention Mechanism.

Bye

Mild Shock schrieb:

Hi,

I spent some time thinking about my primes.pl
test. And came to the conclusion that it
mainly tests the Prolog ALU. Things like

integer successor or integer modulo. Then
I found that Java has Math.floorMod() which
I wasn't using yet. And peng results are better:

/* Dogelog Player 2.1.2 for Java, today */
?- time(test).
% Zeit 286 ms, GC 1 ms, Lips 26302430, Uhr 15.10.2025 02:31
true.

Maybe the Java backend picks a CPU instruction
for Math.floorMod() instead of executing the
longer code sequence that is needed to correct

rem/2 into mod/2. Who knows. I also reorganized
the code a little bit, and eliminated an extra
method call in all arithmetic functions, by

inlining the arithmetic function body in the
evaluable predicate definition code. Comparison
to old measurements and some measurements of

other Prolog systems:

/* Dogelog Player 2.1.2 for Java, weeks ago */
?- time(test).
% Zeit 378 ms, GC 1 ms, Lips 19900780, Uhr 28.08.2025 17:44
true.

/* SWI-Prolog 9.0.4 */
?- time(test).
% 7,506,639 inferences, 0.363 CPU in 0.362 seconds
(100% CPU, 20693560 Lips)
true.

/* Scryer Prolog 0.9.4-639 */
?- time(test).
% CPU time: 0.365s, 7_517_613 inferences
true.

/* Trealla Prolog 2.82.23-3 */
?- time(test).
% Time elapsed 0.868s, 11263917 Inferences, 12.983 MLips
true.

Bye

P.S.: The code uses the hated mathematical mod/2,
and not the cheaper rem/2 that CPUs usually have:

test :-
-a-a len(L, 1000),
-a-a primes(L, _).

primes([], 1).
primes([J|L], J) :-
-a-a primes(L, I),
-a-a K is I+1,
-a-a search(L, K, J).

search(L, I, J) :-
-a-a mem(X, L),
-a-a I mod X =:= 0, !,
-a-a K is I+1,
-a-a search(L, K, J).
search(_, I, I).

mem(X, [X|_]).
mem(X, [_|Y]) :-
-a-a mem(X, Y).

len([], 0) :- !.
len([_|L], N) :-
-a-a N > 0,
-a-a M is N-1,
-a-a len(L, M).

--- Synchronet 3.21a-Linux NewsLink 1.2

From Mild Shock@janburse@fastmail.fm to sci.math on Wed Oct 15 16:06:27 2025

From Newsgroup: sci.math

Hi,

It seems I am having problems pacing with
all the new fancy toys. Wasn't able to really
benchmark my NPU from a Desktop AI machine,

picked the wrong driver. Need to try again.
What worked was benchmarking Mobile AI machines.
I just grabbed Geekbench AI and some devices:

USA Fab, M4:

sANN hANN qANN
iPad CPU 4848 7947 6353
iPad GPU 9752 11383 10051
iPad NPU 4873 36544 *51634*

China Fab, Snapdragon:

sANN hANN qANN
Redmi CPU 1044 950 1723
Redmi GPU 480 905 737
Redmi NNAPI 205 205 469
Redmi QNN 226 226 *10221*

Speed-Up via NPU is factor 10x. See the column
qANN which means quantizised artificial neural
networks, when NPU or QNN is picked.

The mobile AI NPUs are optimized using
mimimal amounts of energy, and minimal amounts
of space squeezing (distilling) everything

into INT8 and INT4.

Bye

Mild Shock schrieb:

Hi,

The change from 378 ms to 286 ms is around 25-30%
is insane. But I did both tests on a novel AI CPU.
To be precise on a AMD Ryzen AI 7 350.

But somehow I picked up rumors that AI CPUs now
might do Neural Network Branch Prediction. The
idea seems to exist in hardware at least since (2012):

Machine learning and artificial intelligence are
the current hype (again). In their new Ryzen
processors, AMD advertises the Neural Net
Prediction. It turns out this is was already
used in their older (2012) Piledriver architecture
used for example in the AMD A10-4600M. It is also
present in recent Samsung processors such as the
one powering the Galaxy S7. What is it really? https://chasethedevil.github.io/post/the_neural_network_in_your_cpu/

It can be done with Convoluted Neural Networks (CNN):

BranchNet: A Convolutional Neural Network to
Predict Hard-To-Predict Branches
To this end, Tarsa et al. proposed using convolutional
neural networks (CNNs) that are trained at
compiletime to accurately predict branches that
TAGE cannot. Given enough profiling coverage, CNNs
learn input-independent branch correlations. https://microarch.org/micro53/papers/738300a118.pdf

Interstingly the above shows cases a PGO based
Machine Learning for Branch Predictors. No clue
how they construct the CPU, that they can feed

it with offline constructed neural neutworks for
their own execution. Maybe an optimizer uses it?
But I guess a more modern-a solutions would not only

use CNN, but also an Attention Mechanism.

Bye

Mild Shock schrieb:

Hi,

I spent some time thinking about my primes.pl
test. And came to the conclusion that it
mainly tests the Prolog ALU. Things like

integer successor or integer modulo. Then
I found that Java has Math.floorMod() which
I wasn't using yet. And peng results are better:

/* Dogelog Player 2.1.2 for Java, today */
?- time(test).
% Zeit 286 ms, GC 1 ms, Lips 26302430, Uhr 15.10.2025 02:31
true.

Maybe the Java backend picks a CPU instruction
for Math.floorMod() instead of executing the
longer code sequence that is needed to correct

rem/2 into mod/2. Who knows. I also reorganized
the code a little bit, and eliminated an extra
method call in all arithmetic functions, by

inlining the arithmetic function body in the
evaluable predicate definition code. Comparison
to old measurements and some measurements of

other Prolog systems:

/* Dogelog Player 2.1.2 for Java, weeks ago */
?- time(test).
% Zeit 378 ms, GC 1 ms, Lips 19900780, Uhr 28.08.2025 17:44
true.

/* SWI-Prolog 9.0.4 */
?- time(test).
% 7,506,639 inferences, 0.363 CPU in 0.362 seconds
(100% CPU, 20693560 Lips)
true.

/* Scryer Prolog 0.9.4-639 */
?- time(test).
% CPU time: 0.365s, 7_517_613 inferences
true.

/* Trealla Prolog 2.82.23-3 */
?- time(test).
% Time elapsed 0.868s, 11263917 Inferences, 12.983 MLips
true.

Bye

P.S.: The code uses the hated mathematical mod/2,
and not the cheaper rem/2 that CPUs usually have:

test :-
-a-a-a len(L, 1000),
-a-a-a primes(L, _).

primes([], 1).
primes([J|L], J) :-
-a-a-a primes(L, I),
-a-a-a K is I+1,
-a-a-a search(L, K, J).

search(L, I, J) :-
-a-a-a mem(X, L),
-a-a-a I mod X =:= 0, !,
-a-a-a K is I+1,
-a-a-a search(L, K, J).
search(_, I, I).

mem(X, [X|_]).
mem(X, [_|Y]) :-
-a-a-a mem(X, Y).

len([], 0) :- !.
len([_|L], N) :-
-a-a-a N > 0,
-a-a-a M is N-1,
-a-a-a len(L, M).

--- Synchronet 3.21a-Linux NewsLink 1.2

From Mild Shock@janburse@fastmail.fm to sci.math on Wed Oct 15 16:13:10 2025

From Newsgroup: sci.math

Hi,

But not only Mobie AI and Desktop AI are making
a broader imprint now. We might also experience
Workstation AI, with a 3'000.- USD price tag:

You Can't Buy This... Yet! The NVIDIA GB10 from Dell
The New Superchip that Terrifies the Cloud! https://www.youtube.com/watch?v=x1qViw4xyVo

So whats going on? I was asking Phind, which is
driven by a 70B model tailored towards developers:

Q: Is there an AI inflection point right now ,
with NPUs in mobile, desktop and workstation

A: Evidence of the Inflection Point

- Mobile Leadership
NPUs originated in smartphones
Now becoming ubiquitous across all device types
Enabling sophisticated AI features at consumer price points

- Desktop Revolution
Major manufacturers implementing NPUs across product lines
Apple's Neural Engine integrated into M-series chips
Qualcomm, Intel, and AMD incorporating AI accelerators

- Workstation Transformation
Professional-grade NPUs in mobile workstations
Demonstrated superior performance for AI-specific tasks
Enabling local processing of previously cloud-dependent workloads

https://www.phind.com/search/cmgs1s6jv00023h67g5z2aaa0

Bye

Mild Shock schrieb:

Hi,

It seems I am having problems pacing with
all the new fancy toys. Wasn't able to really
benchmark my NPU from a Desktop AI machine,

picked the wrong driver. Need to try again.
What worked was benchmarking Mobile AI machines.
I just grabbed Geekbench AI and some devices:

USA Fab, M4:

-a-a-a sANN-a-a-a hANN-a-a-a qANN
iPad CPU-a-a-a 4848-a-a-a 7947-a-a-a 6353
iPad GPU-a-a-a 9752-a-a-a 11383-a-a-a 10051
iPad NPU-a-a-a 4873-a-a-a 36544-a-a-a *51634*

China Fab, Snapdragon:

-a-a-a sANN-a-a-a hANN-a-a-a qANN
Redmi CPU-a-a-a 1044-a-a-a 950-a-a-a 1723
Redmi GPU-a-a-a 480-a-a-a 905-a-a-a 737
Redmi NNAPI-a-a-a 205-a-a-a 205-a-a-a 469
Redmi QNN-a-a-a 226-a-a-a 226-a-a-a *10221*

Speed-Up via NPU is factor 10x. See the column
qANN which means quantizised artificial neural
networks, when NPU or QNN is picked.

The mobile AI NPUs are optimized using
mimimal amounts of energy, and minimal amounts
of space squeezing (distilling) everything

into INT8 and INT4.

Bye

Mild Shock schrieb:

Hi,

The change from 378 ms to 286 ms is around 25-30%
is insane. But I did both tests on a novel AI CPU.
To be precise on a AMD Ryzen AI 7 350.

But somehow I picked up rumors that AI CPUs now
might do Neural Network Branch Prediction. The
idea seems to exist in hardware at least since (2012):

Machine learning and artificial intelligence are
the current hype (again). In their new Ryzen
processors, AMD advertises the Neural Net
Prediction. It turns out this is was already
used in their older (2012) Piledriver architecture
used for example in the AMD A10-4600M. It is also
present in recent Samsung processors such as the
one powering the Galaxy S7. What is it really?
https://chasethedevil.github.io/post/the_neural_network_in_your_cpu/

It can be done with Convoluted Neural Networks (CNN):

BranchNet: A Convolutional Neural Network to
Predict Hard-To-Predict Branches
To this end, Tarsa et al. proposed using convolutional
neural networks (CNNs) that are trained at
compiletime to accurately predict branches that
TAGE cannot. Given enough profiling coverage, CNNs
learn input-independent branch correlations.
https://microarch.org/micro53/papers/738300a118.pdf

Interstingly the above shows cases a PGO based
Machine Learning for Branch Predictors. No clue
how they construct the CPU, that they can feed

it with offline constructed neural neutworks for
their own execution. Maybe an optimizer uses it?
But I guess a more modern-a solutions would not only

use CNN, but also an Attention Mechanism.

Bye

Mild Shock schrieb:

Hi,

I spent some time thinking about my primes.pl
test. And came to the conclusion that it
mainly tests the Prolog ALU. Things like

integer successor or integer modulo. Then
I found that Java has Math.floorMod() which
I wasn't using yet. And peng results are better:

/* Dogelog Player 2.1.2 for Java, today */
?- time(test).
% Zeit 286 ms, GC 1 ms, Lips 26302430, Uhr 15.10.2025 02:31
true.

Maybe the Java backend picks a CPU instruction
for Math.floorMod() instead of executing the
longer code sequence that is needed to correct

rem/2 into mod/2. Who knows. I also reorganized
the code a little bit, and eliminated an extra
method call in all arithmetic functions, by

inlining the arithmetic function body in the
evaluable predicate definition code. Comparison
to old measurements and some measurements of

other Prolog systems:

/* Dogelog Player 2.1.2 for Java, weeks ago */
?- time(test).
% Zeit 378 ms, GC 1 ms, Lips 19900780, Uhr 28.08.2025 17:44
true.

/* SWI-Prolog 9.0.4 */
?- time(test).
% 7,506,639 inferences, 0.363 CPU in 0.362 seconds
(100% CPU, 20693560 Lips)
true.

/* Scryer Prolog 0.9.4-639 */
?- time(test).
% CPU time: 0.365s, 7_517_613 inferences
true.

/* Trealla Prolog 2.82.23-3 */
?- time(test).
% Time elapsed 0.868s, 11263917 Inferences, 12.983 MLips
true.

Bye

P.S.: The code uses the hated mathematical mod/2,
and not the cheaper rem/2 that CPUs usually have:

test :-
-a-a-a len(L, 1000),
-a-a-a primes(L, _).

primes([], 1).
primes([J|L], J) :-
-a-a-a primes(L, I),
-a-a-a K is I+1,
-a-a-a search(L, K, J).

search(L, I, J) :-
-a-a-a mem(X, L),
-a-a-a I mod X =:= 0, !,
-a-a-a K is I+1,
-a-a-a search(L, K, J).
search(_, I, I).

mem(X, [X|_]).
mem(X, [_|Y]) :-
-a-a-a mem(X, Y).

len([], 0) :- !.
len([_|L], N) :-
-a-a-a N > 0,
-a-a-a M is N-1,
-a-a-a len(L, M).

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

Recent Visitors

System Info

primes.pl mainly tests the Prolog ALU [mod/2 vs rem/2]