Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 27 |
Nodes: | 6 (0 / 6) |
Uptime: | 35:56:30 |
Calls: | 631 |
Calls today: | 2 |
Files: | 1,187 |
D/L today: |
22 files (29,767K bytes) |
Messages: | 172,999 |
Hi,
I spent some time thinking about my primes.pl
test. And came to the conclusion that it
mainly tests the Prolog ALU. Things like
integer successor or integer modulo. Then
I found that Java has Math.floorMod() which
I wasn't using yet. And peng results are better:
/* Dogelog Player 2.1.2 for Java, today */
?- time(test).
% Zeit 286 ms, GC 1 ms, Lips 26302430, Uhr 15.10.2025 02:31
true.
Maybe the Java backend picks a CPU instruction
for Math.floorMod() instead of executing the
longer code sequence that is needed to correct
rem/2 into mod/2. Who knows. I also reorganized
the code a little bit, and eliminated an extra
method call in all arithmetic functions, by
inlining the arithmetic function body in the
evaluable predicate definition code. Comparison
to old measurements and some measurements of
other Prolog systems:
/* Dogelog Player 2.1.2 for Java, weeks ago */
?- time(test).
% Zeit 378 ms, GC 1 ms, Lips 19900780, Uhr 28.08.2025 17:44
true.
/* SWI-Prolog 9.0.4 */
?- time(test).
% 7,506,639 inferences, 0.363 CPU in 0.362 seconds
(100% CPU, 20693560 Lips)
true.
/* Scryer Prolog 0.9.4-639 */
?- time(test).
% CPU time: 0.365s, 7_517_613 inferences
true.
/* Trealla Prolog 2.82.23-3 */
?- time(test).
% Time elapsed 0.868s, 11263917 Inferences, 12.983 MLips
true.
Bye
P.S.: The code uses the hated mathematical mod/2,
and not the cheaper rem/2 that CPUs usually have:
test :-
-a-a len(L, 1000),
-a-a primes(L, _).
primes([], 1).
primes([J|L], J) :-
-a-a primes(L, I),
-a-a K is I+1,
-a-a search(L, K, J).
search(L, I, J) :-
-a-a mem(X, L),
-a-a I mod X =:= 0, !,
-a-a K is I+1,
-a-a search(L, K, J).
search(_, I, I).
mem(X, [X|_]).
mem(X, [_|Y]) :-
-a-a mem(X, Y).
len([], 0) :- !.
len([_|L], N) :-
-a-a N > 0,
-a-a M is N-1,
-a-a len(L, M).
Hi,
The change from 378 ms to 286 ms is around 25-30%
is insane. But I did both tests on a novel AI CPU.
To be precise on a AMD Ryzen AI 7 350.
But somehow I picked up rumors that AI CPUs now
might do Neural Network Branch Prediction. The
idea seems to exist in hardware at least since (2012):
Machine learning and artificial intelligence are
the current hype (again). In their new Ryzen
processors, AMD advertises the Neural Net
Prediction. It turns out this is was already
used in their older (2012) Piledriver architecture
used for example in the AMD A10-4600M. It is also
present in recent Samsung processors such as the
one powering the Galaxy S7. What is it really? https://chasethedevil.github.io/post/the_neural_network_in_your_cpu/
It can be done with Convoluted Neural Networks (CNN):
BranchNet: A Convolutional Neural Network to
Predict Hard-To-Predict Branches
To this end, Tarsa et al. proposed using convolutional
neural networks (CNNs) that are trained at
compiletime to accurately predict branches that
TAGE cannot. Given enough profiling coverage, CNNs
learn input-independent branch correlations. https://microarch.org/micro53/papers/738300a118.pdf
Interstingly the above shows cases a PGO based
Machine Learning for Branch Predictors. No clue
how they construct the CPU, that they can feed
it with offline constructed neural neutworks for
their own execution. Maybe an optimizer uses it?
But I guess a more modern-a solutions would not only
use CNN, but also an Attention Mechanism.
Bye
Mild Shock schrieb:
Hi,
I spent some time thinking about my primes.pl
test. And came to the conclusion that it
mainly tests the Prolog ALU. Things like
integer successor or integer modulo. Then
I found that Java has Math.floorMod() which
I wasn't using yet. And peng results are better:
/* Dogelog Player 2.1.2 for Java, today */
?- time(test).
% Zeit 286 ms, GC 1 ms, Lips 26302430, Uhr 15.10.2025 02:31
true.
Maybe the Java backend picks a CPU instruction
for Math.floorMod() instead of executing the
longer code sequence that is needed to correct
rem/2 into mod/2. Who knows. I also reorganized
the code a little bit, and eliminated an extra
method call in all arithmetic functions, by
inlining the arithmetic function body in the
evaluable predicate definition code. Comparison
to old measurements and some measurements of
other Prolog systems:
/* Dogelog Player 2.1.2 for Java, weeks ago */
?- time(test).
% Zeit 378 ms, GC 1 ms, Lips 19900780, Uhr 28.08.2025 17:44
true.
/* SWI-Prolog 9.0.4 */
?- time(test).
% 7,506,639 inferences, 0.363 CPU in 0.362 seconds
(100% CPU, 20693560 Lips)
true.
/* Scryer Prolog 0.9.4-639 */
?- time(test).
% CPU time: 0.365s, 7_517_613 inferences
true.
/* Trealla Prolog 2.82.23-3 */
?- time(test).
% Time elapsed 0.868s, 11263917 Inferences, 12.983 MLips
true.
Bye
P.S.: The code uses the hated mathematical mod/2,
and not the cheaper rem/2 that CPUs usually have:
test :-
-a-a-a len(L, 1000),
-a-a-a primes(L, _).
primes([], 1).
primes([J|L], J) :-
-a-a-a primes(L, I),
-a-a-a K is I+1,
-a-a-a search(L, K, J).
search(L, I, J) :-
-a-a-a mem(X, L),
-a-a-a I mod X =:= 0, !,
-a-a-a K is I+1,
-a-a-a search(L, K, J).
search(_, I, I).
mem(X, [X|_]).
mem(X, [_|Y]) :-
-a-a-a mem(X, Y).
len([], 0) :- !.
len([_|L], N) :-
-a-a-a N > 0,
-a-a-a M is N-1,
-a-a-a len(L, M).
You Can't Buy This... Yet! The NVIDIA GB10 from Dell
The New Superchip that Terrifies the Cloud! https://www.youtube.com/watch?v=x1qViw4xyVo
Hi,
It seems I am having problems pacing with
all the new fancy toys. Wasn't able to really
benchmark my NPU from a Desktop AI machine,
picked the wrong driver. Need to try again.
What worked was benchmarking Mobile AI machines.
I just grabbed Geekbench AI and some devices:
USA Fab, M4:
-a-a-a sANN-a-a-a hANN-a-a-a qANN
iPad CPU-a-a-a 4848-a-a-a 7947-a-a-a 6353
iPad GPU-a-a-a 9752-a-a-a 11383-a-a-a 10051
iPad NPU-a-a-a 4873-a-a-a 36544-a-a-a *51634*
China Fab, Snapdragon:
-a-a-a sANN-a-a-a hANN-a-a-a qANN
Redmi CPU-a-a-a 1044-a-a-a 950-a-a-a 1723
Redmi GPU-a-a-a 480-a-a-a 905-a-a-a 737
Redmi NNAPI-a-a-a 205-a-a-a 205-a-a-a 469
Redmi QNN-a-a-a 226-a-a-a 226-a-a-a *10221*
Speed-Up via NPU is factor 10x. See the column
qANN which means quantizised artificial neural
networks, when NPU or QNN is picked.
The mobile AI NPUs are optimized using
mimimal amounts of energy, and minimal amounts
of space squeezing (distilling) everything
into INT8 and INT4.
Bye
Mild Shock schrieb:
Hi,
The change from 378 ms to 286 ms is around 25-30%
is insane. But I did both tests on a novel AI CPU.
To be precise on a AMD Ryzen AI 7 350.
But somehow I picked up rumors that AI CPUs now
might do Neural Network Branch Prediction. The
idea seems to exist in hardware at least since (2012):
Machine learning and artificial intelligence are
the current hype (again). In their new Ryzen
processors, AMD advertises the Neural Net
Prediction. It turns out this is was already
used in their older (2012) Piledriver architecture
used for example in the AMD A10-4600M. It is also
present in recent Samsung processors such as the
one powering the Galaxy S7. What is it really?
https://chasethedevil.github.io/post/the_neural_network_in_your_cpu/
It can be done with Convoluted Neural Networks (CNN):
BranchNet: A Convolutional Neural Network to
Predict Hard-To-Predict Branches
To this end, Tarsa et al. proposed using convolutional
neural networks (CNNs) that are trained at
compiletime to accurately predict branches that
TAGE cannot. Given enough profiling coverage, CNNs
learn input-independent branch correlations.
https://microarch.org/micro53/papers/738300a118.pdf
Interstingly the above shows cases a PGO based
Machine Learning for Branch Predictors. No clue
how they construct the CPU, that they can feed
it with offline constructed neural neutworks for
their own execution. Maybe an optimizer uses it?
But I guess a more modern-a solutions would not only
use CNN, but also an Attention Mechanism.
Bye
Mild Shock schrieb:
Hi,
I spent some time thinking about my primes.pl
test. And came to the conclusion that it
mainly tests the Prolog ALU. Things like
integer successor or integer modulo. Then
I found that Java has Math.floorMod() which
I wasn't using yet. And peng results are better:
/* Dogelog Player 2.1.2 for Java, today */
?- time(test).
% Zeit 286 ms, GC 1 ms, Lips 26302430, Uhr 15.10.2025 02:31
true.
Maybe the Java backend picks a CPU instruction
for Math.floorMod() instead of executing the
longer code sequence that is needed to correct
rem/2 into mod/2. Who knows. I also reorganized
the code a little bit, and eliminated an extra
method call in all arithmetic functions, by
inlining the arithmetic function body in the
evaluable predicate definition code. Comparison
to old measurements and some measurements of
other Prolog systems:
/* Dogelog Player 2.1.2 for Java, weeks ago */
?- time(test).
% Zeit 378 ms, GC 1 ms, Lips 19900780, Uhr 28.08.2025 17:44
true.
/* SWI-Prolog 9.0.4 */
?- time(test).
% 7,506,639 inferences, 0.363 CPU in 0.362 seconds
(100% CPU, 20693560 Lips)
true.
/* Scryer Prolog 0.9.4-639 */
?- time(test).
% CPU time: 0.365s, 7_517_613 inferences
true.
/* Trealla Prolog 2.82.23-3 */
?- time(test).
% Time elapsed 0.868s, 11263917 Inferences, 12.983 MLips
true.
Bye
P.S.: The code uses the hated mathematical mod/2,
and not the cheaper rem/2 that CPUs usually have:
test :-
-a-a-a len(L, 1000),
-a-a-a primes(L, _).
primes([], 1).
primes([J|L], J) :-
-a-a-a primes(L, I),
-a-a-a K is I+1,
-a-a-a search(L, K, J).
search(L, I, J) :-
-a-a-a mem(X, L),
-a-a-a I mod X =:= 0, !,
-a-a-a K is I+1,
-a-a-a search(L, K, J).
search(_, I, I).
mem(X, [X|_]).
mem(X, [_|Y]) :-
-a-a-a mem(X, Y).
len([], 0) :- !.
len([_|L], N) :-
-a-a-a N > 0,
-a-a-a M is N-1,
-a-a-a len(L, M).