Forum: Too Lazy BBS

Who's Online

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	26
Nodes:	6 (0 / 6)
Uptime:	56:39:23
Calls:	481
Files:	1,071
Messages:	95,868

Re: Efficient implementation of Finite State Machines (FSM)

From Anton Ertl@21:1/5 to Gerry Jackson on Sun Mar 9 16:29:23 2025

Gerry Jackson <do-not-use@swldwa.uk> writes:

Use of a wordlist, whose wid is held in an immediate constant, enables
easy linkage between states at compile time, eg a typical state action
in outline is:

:noname <action>
if state-x [ >order ] next-state [ previous ]
else state-y [ >order ] next-state [ previous ]
; this-state >order to next-state previous

It means that Gforth will have to improve its SEE in order to point
out the differences between the different NEXT-STATEs. Currently I get:

/2 >order ok
next-state xt-see
noname :
dup @ #1 and
IF next-state
ELSE next-state
THEN ; ok

Disadvantages are:
- the Forth code doesn't give much idea of the overall operation of the
FSM (probably true for FSMs in general)

That's definitely the case here. IIRC for Michael Gassanenko it was a demonstration of filtering and backtracking, but it's unclear to me
how that transfers to FSMs.

Anyway, let's look at the core:

: run-fsm ( ad xt -- ) begin dup while execute repeat 2drop ;

This means that every state has to return to this loop to dispatch the
next one. Now Gforth (development) has EXECUTE-EXIT, which allows
tail-calling the next state for better efficiency.

I have worked out an example: https://www.complang.tuwien.ac.at/forth/programs/fsm-ae.4th

Let's first look at the example: The example recognizes and prints
numbers in a text and ignores everything else. It terminates when it
sees '$'. It has two states, one for being inside a number and one
for outside:

state outside-num
state inside-num

(Note that this is not the standar Forth word STATE).

Then we define transitions:

: out->out ( c-addr -- c-addr1 )
count outside-num transition ;

' out->out outside-num all-transitions

The out->out transition is the simplest one: It fetches the next char
(with COUNT) and switches to OUTSIDE-NUM. TRANSITION already starts
the dispatch for that state and the next char; this (and maybe also
COUNT) could be put in the general FSM interpreter (START-DFA), but by
having TRANSITION in the individual transition actions (e.g.,
OUT->OUT), the implementation is more flexible, as we will see.

At first OUT->OUT is put in transitions from OUTSIDE-NUM for all
characters using ALL-TANSITIONS; later the transitions of various
characters are overwritten:

' out->in '9' 1+ '0' outside-num some-transitions
' out->stop '$' outside-num one-transition

Note that the stack effect comment for out->out is from the start of
the word to the start of the next state-transition word; the actual
stack effect depends on the implementation of transition.

For more state transitions and the corresponding transition words see
the source code.

Example usage:

s" 123 abc 456 df$" drop outside-num start-dfa \ prints "123 456"

Now for the implementations: States are just arrays of xts, indexed by
the character, and the xt is that of the transition from the state
with that character.

The implementation without EXECUTE-EXIT looks as follows:

: transition ( c addr -- xt )
\ addr specifies the next state
]] swap th @ [[ ; immediate

: stop ( c-addr -- 0 )
drop 0 ;

: start-dfa ( c-addr addr -- )
swap count rot transition
begin ( ... xt )
execute dup
0= until
drop ;

TRANSITION could be a plain colon definition here, but it is a macro
in order to make it more competetive in Gforth with the EXECUTE-EXIT
variant. Here the termination is performed by replacing the next
c-addr with 0 and testing for 0 in the loop.

An alternative implementation is to use EXECUTE-EXIT to tail-call the
next transition:

: transition ( c addr -- )
]] swap th @ execute-exit [[ ; immediate

: stop ( -- )
\ let the ";" behind the STOP do the stopping
]] drop [[ ; immediate

: start-dfa ( c-addr addr -- )
\ let the dfa work on the string starting at c-addr, with initial
\ state addr
swap count rot transition ;

Here TRANSITION contains the EXECUTE in the form of EXECUTE-EXIT, and
so each transition word directly calls the next one, and no loop is
necessary; with EXECUTE this would fill the return stack after a few
thousand transitions, but EXECUTE-EXIT takes the return address off
the return stack before EXECUTEing the next word and therefore can
perform an indefinite number of state transitions.

So how do we get out of the state machine? By not performing a
transition; instead we simply return to the caller of START-DFA.

I looked at the generated code and thought that we can get rid of the
SWAP in the transition code by using the generalized constant folding
feature of Gforth. This replaces the definition of TRANSITION with:

: noopt-transition-compile, ( xt -- )
\ basic compile, implementation for TRANSITION
drop ]] swap th @ execute-exit [[ ;

: 1opt-transition-compile, ( xt -- )
drop lits> ]] cells literal + @ execute-exit [[ ;

`noopt-transition-compile, 1 foldn: transition-compile,
`1opt-transition-compile, 1 set-fold#

: transition ( c addr -- )
\ dispatches the transition code for char C in state ADDR. The
\ stack effect describes the state of the stack when the xt has
\ been consumed, not when the EXECUTE-EXIT returns, if at all;
\ either compile, or execute-exit this word in order to achieve
\ tail-call optimization
[ 0 noopt-transition-compile, ] ;
' transition-compile, set-optimizer

The NOOPT-TRANSITION-COMPILE, implementation is used when TRANSITION
is compiled without a preceding constant, 1OPT-TRANSITION-COMPILE, is
used when 1 or more constants precede the compilation of TRANSITION.
In the latter case code without the SWAP is generated.

How fast are the variants. I have a benchmark that performs 1M
iterations of processing a string of 100 non-digits (last char is '$',
of course), and one where the processed string contains numbers. The
latter just takes more instructions and cycles, but the former just
performs 99 OUT->OUT transitions in each iteration and shows the basic performance of the whole scheme.

Here are the results on a Zen4:

loop plain optimized implementation variant 1_278_763_454 1_175_241_846 1_175_505_964 cycles
3_651_376_357 2_441_376_030 2_045_832_844 instructions

"loop" is the version that has a loop; the others are the EXECUTE-EXIT variants, "plain" uses the macro, "optimized" tries to do better by
recognizing that there is a constant in play. While "optimized" saves
4 instructions per transition compared to "plain", it does not save
cycles. Overall the number of cycles is quite high given the number
of instructions and the capabilities of the CPU. I'll take a look at
it below. The loop variant costs 12 instructions and 1 cycle per
transition more than the plain variant.

Here's the code for OUT->OUT:

plain optimized
count 1->2 count 1->2
mov rax,r8 mov rax,r8
add r8,$01 add r8,$01
movzx r15d,byte PTR [eax] movzx r15d,byte PTR [eax]
lit 2->3 cells 2->2
outside-num shl r15,$03
mov r9,$10[rbx] lit+ 2->2
swap 3->3 outside-num
mov rax,r15 add r15,$18[rbx]
mov r15,r9 @ 2->2
mov r9,rax mov r15,[r15]
cells 3->3 execute-;s 2->1
shl r9,$03 add rbx,$30
+ 3->2 mov rbx,$00[r13]
add r15,r9 mov rax,-$10[r15]
@ 2->2 mov rdx,r15
mov r15,[r15] add r13,$08
execute-;s 2->1 sub rbx,$08
add rbx,$40 jmp eax
mov rbx,$00[r13]
mov rax,-$10[r15]
mov rdx,r15
add r13,$08
sub rbx,$08
jmp eax

You see only 13 instructions in OPTIMIZED here. The jump first leads
to DOCOL (6 instructions) before enterin this code again.

For now I don't see why the whole thing takes so many cycles. I'll
take a closer look later.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gerry Jackson@21:1/5 to Anton Ertl on Mon Mar 10 11:49:26 2025

On 09/03/2025 16:29, Anton Ertl wrote:

Gerry Jackson <do-not-use@swldwa.uk> writes:

Use of a wordlist, whose wid is held in an immediate constant, enables
easy linkage between states at compile time, eg a typical state action
in outline is:

:noname <action>
if state-x [ >order ] next-state [ previous ]
else state-y [ >order ] next-state [ previous ]
; this-state >order to next-state previous

It means that Gforth will have to improve its SEE in order to point
out the differences between the different NEXT-STATEs. Currently I get:

/2 >order ok
next-state xt-see
noname :
dup @ #1 and
IF next-state
ELSE next-state
THEN ; ok

:) Well if you've never encountered this in (how many years has GForth
been going - 25/30?) it's haardly worth bothering. At least it make
someone think something strange is going on.

Disadvantages are:
- the Forth code doesn't give much idea of the overall operation of the
FSM (probably true for FSMs in general)

That's definitely the case here.

THe same applies to your example. Most FSM implementations have a
transition table and/or a state diagram.

IIRC for Michael Gassanenko it was a
demonstration of filtering and backtracking,

That's correct. I said it was a simple example of an FSM to demonstrate
the principle the FSM

but it's unclear to me
how that transfers to FSMs.

The transition table for my example is:
Next State
State If test true If test false
----- ------------ -------------
inc-n /2 exit FSM
/2 inc-n /3
/3 inc-n .n
.n inc-n inc-n

As posts are text only it's not suitable for a state diagram.

Anyway, let's look at the core:

: run-fsm ( ad xt -- ) begin dup while execute repeat 2drop ;

This means that every state has to return to this loop to dispatch the
next one. Now Gforth (development) has EXECUTE-EXIT, which allows tail-calling the next state for better efficiency.

You know more about effects of an instruction cache than me bu wouldn't
a short loop like that be likely to fit in a cache line?

Your example with EXECUTE-EXIT made me think that I ought to be able to
change mine to include an R> DROP at the start of each state to get the
same effect and eliminate the loop. Non-standard of course, slower but apparently more portable. Possibly replacing the NEXT-STATE value with a
common deferred word to handle forward definitions as before..

I have worked out an example: https://www.complang.tuwien.ac.at/forth/programs/fsm-ae.4th

Got to go but I'll return to this and look at your example in more
detail later today hopefully.

[...]

- anton

--
Gerry

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Tue Mar 11 09:53:42 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Here are the results on a Zen4:

loop plain optimized implementation variant
1_278_763_454 1_175_241_846 1_175_505_964 cycles
3_651_376_357 2_441_376_030 2_045_832_844 instructions

...

For now I don't see why the whole thing takes so many cycles. I'll
take a closer look later.

It seems to be an interaction between microarchitectural
corner-cutting in Zen4 and the way that Gforth is implemented, see <2025Mar11.091817@mips.complang.tuwien.ac.at> and <2025Mar10.181427@mips.complang.tuwien.ac.at> in comp.arch.

So let's measure on Golden Cove (Intel 1315U P-Core) which apparently
does not have that problem, at least not for the "dup execute-exit" microbenchmark I used in those measurements. For the fsm-ae.4th
bench1 benchmark I see the following, however:

loop plain optimized implementation variant 1_193_518_309 1_284_084_677 1_282_838_710 cycles
3_755_406_062 2_445_475_893 2_049_460_804 instructions

I.e., the plain and optimized variants are even slower than on Zen4,
and the loop version is not just faster than on Zen4, but even faster
than plain and optimized. To see why that is, here's the out->out code followed by the docol code:

count 1->2
mov rax,r8
add r8,$01
movzx r15d,byte PTR [eax]
cells 2->2
shl r15,$03
lit+ 2->2
outside-num
0-6 add r15,$18[rbx]
@ 2->2
6-11 mov r15,[r15]
execute-;s 2->1
add rbx,$30
mov rbx,[r14]
mov rax,-$10[r15]
11-11 mov rdx,r15
add r14,$08
sub rbx,$08
jmp eax

add rbx,$08
sub r14,$08
mov [r14],rbx
11-11 mov rbx,rdx
mov rax,[rbx]
jmp eax

In the benchmark the out->out code jumps to docol and docol jumps to
out->out in 99/100 cases.

I see a depence loop here that works through the instructins where I
have added some annotations like "0-6" in front. 0-6 means that with
rbx available in cycle 0, the result is available in r15 in cycle 6 (5
cycles from the load, 1 from the addition). The next instruction in
the chain depends on that result in r15 and produces another result in
r15. Then we have mov instructions that move the result into rdx and
finally into rbx where it enters the sequence again.
Register-register mov instructions usually consume no cycles on the
Golden Cove, so they are counted with 0 cycles here.

My computation explains 11 cycles per iteration, but not the measured
12; I may be unaware of another source of delay, or some other
dependence chain may be the reason; I am pretty sure that I have the
right one, though, especially because the results with "dup
execute-exit" eliminate this particular dependence chain and run at
2.4 cycles per iteration on the Golden Cove.

I also let llvm-mca (microarchitectural analysis) have a go at it, and
it assumes 1 cycle latency for the register-register moves, and
reports:

[~:156499] cat tmp/yyy.s |llvm-mca-16 -mcpu=alderlake --iterations=1000 -timeline
Iterations: 1000
Instructions: 19000
Total Cycles: 13019
Total uOps: 12000

So it claims that this sequence should take 13 cycles per iteration.

It also gives various other information but I find that hard to
interpret.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gerry Jackson@21:1/5 to Anton Ertl on Fri Mar 7 12:23:46 2025

On 06/03/2025 18:09, Anton Ertl wrote:

mhx@iae.nl (mhx) writes:

However, a state machine has well defined rules based on a
state's stored information and its inputs, causing it to go to
another well-defined state while generating outputs. In that
context a goto is harmless and merely serves as a crutch when
there are not enough computing nodes to serve all states in
parallel. How to make such an efficient crutch in Forth?

You lost me. Why would one "serve all states in parallel"?

Anyway, if the question is how to implement a state machine
efficiently in Forth, one answer is to stay at a higher, more
structured level of abstraction or recreate it from the state machine.
E.g., don't transform a regular expression into a (indeterministic or deterministic) finite state machine, but instead interpret it directly (that's what Bernd Paysan's regexp.fs does). Or instead of
transforming a grammar into a push-down automaton, transform it into a structured Forth program (like Gray does).

If you cannot do that, in standard Forth you don't really have good
options. The best is probably to have the current state on the stack (probably in the form of the address of an array indexed with the
input (or whatever causes a state change) and containing the potential
next states at the appropriate elements.

In a particular implementation, you can do more, including goto-like
things. What I would do then is have a colon definition per state,
and do the transition to the next state as tail call. Have some
efficient forward-tail-call mechanism to allow calling states where
the definition comes later. Gforth has a clever FORWARD, but for now
that does not do tail-calls.

- anton

About a year ago I wanted a finite state machine and thought that the
Julian Noble approach was rather slow, so I developed a different
approach that I haven't seen elsewhere. This method defines a state of
the FSM as a wordlist that contains an action including a test, (a
:noname definition), and a VALUE that holds the execution token of the
state's action/test.

A simple example that demonstrates the principle is given below, it
implements the Michael Gassanenko example that selects integers that can
be divided by 6. See https://groups.google.com/g/comp.lang.forth/c/iHCT2IaqxSA/m/85IUu0GSBwAJ
for another implementation. An FSM for the /6 example is not the best
solution for the example but is simply used to demonstrate the principle.

Use of a wordlist, whose wid is held in an immediate constant, enables
easy linkage between states at compile time, eg a typical state action
in outline is:

:noname <action>
if state-x [ >order ] next-state [ previous ]
else state-y [ >order ] next-state [ previous ]
; this-state >order to next-state previous

where
- next-state is the VALUE holding the states action xt. A common name is
used for each state so that code can be factored out.
- state-x and state-y are the immediate constants holding the state's
wordlist.
- the wordlist switching is factored out for readability

Advantages are:
- the FSM run time loop is simple
- wordlist switching only occurs at compile time
- forward reference is easy because state wordlists and a common name
for the action-xt VALUEs are defined at the start.
- easily extendable for states that can goto to several other states
e.g. use a CASE statement for the state action.
- I believe but haven't proved that the run time code could be
automatically generated by defining a simple state transition table

Disadvantages are:
- the Forth code doesn't give much idea of the overall operation of the
FSM (probably true for FSMs in general)
- the state wordlists are redundant at run time

\ The example
: fsm-state ( "state-name" "action-name" -- )
get-current
wordlist dup constant immediate set-current 0 value
set-current
;

\ state name state action-xt
\ ~~~~~~~~~~ ~~~~~~~~~~~~~~~
fsm-state inc-n next-state
fsm-state /2 next-state
fsm-state /3 next-state
fsm-state .n next-state

: is-next-state ( wid -- )
>order s" next-state" evaluate previous
; immediate

0 constant stop-fsm

:noname ( ad -- ad xt|0 )
dup 1 over +! 2@ >=
if /2 is-next-state else stop-fsm then
; inc-n >order to next-state previous

:noname ( ad -- ad xt )
dup @ 2 mod if inc-n is-next-state else /3 is-next-state then
\ No, the two IS-NEXT_STATEs can't be replaced by one after the THEN
; /2 >order to next-state previous

:noname ( ad -- ad xt )
dup @ 3 mod if inc-n is-next-state else .n is-next-state then
; /3 >order to next-state previous

:noname ( ad -- ad xt )
dup @ . ( drop ) inc-n is-next-state
; .n >order to next-state previous

: run-fsm ( ad xt -- ) begin dup while execute repeat 2drop ;

2variable counter

inc-n >order next-state constant start previous

: /6? ( n -- )
cr cr 0 counter 2!
counter start run-fsm
;

78 /6? \ displays 6 12 18 24 30 36 42 48 54 60 66 72 78

--
Gerry

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Re: Efficient implementation of Finite State Machines (FSM)