Forum: Too Lazy BBS

Re: branch splitting

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 10:24:54 2026

From Newsgroup: comp.arch

On 11/5/25 3:43 PM, MitchAlsup wrote:
[snip]

I am now working on predictors for a 6-wide My 66000 machine--which is a bit different.
a) VEC-LOOP loops do not alter the branch prediction tables.
b) Predication clauses do not alter the BPTs.

Not recording the history of predicates may have a negative
effect on global history predictors. (I do not know if anyone
has studied this, but it has been mentioned rCo e.g.,
"[predication] has a negative side-effect because the removal
of branches eliminates useful correlation information
necessary for conventional branch predictors" from "Improving
Branch Prediction and Predicated Execution in Out-of-Order
Processors", Eduardo Qui|#ones et al., 2007.)

Predicate prediction can also be useful when the availability
of the predicate is delayed. Similarly, selective eager
execution might be worthwhile when the predicate is delayed;
the selection is likely to be predictive (resource use might
be a basis for selection but even estimating that might be
predictive).

With predication of all short forward branches in order to
avoid fetch bubbles, the impact of delayed predicate
availability and missing information for branch prediction
may be greater than for more selective predication.

There may also be some short forward branches that are 99%
taken such that converting to a longer branch with a jump back
may be a better option. With trace-cache-like optimization,
such branched over code could be removed from fetch even when
the compiler used a short branch. Dynamic code organization
has the advantage of being able to use dynamically available
information (and the disadvantage of gathering the information
and making a decision dynamically).

Something like a branch target cache could store extracted
instructions. This might facilitate stitching such
instructions back into the instruction stream with limited
overhead. Since this would only work for usually taken
hammock branches, it would probably not be worthwhile. For
if-then-else constructs, one might place both paths in
separate entries in such a target cache and always stitch
in one of them, but that seems wonky.

I rather doubt the benefits of such would justify the added
complexity rCo almost certainly not in a first or second
implementation of an architecture rCo but I would not want to
reject future possible adoption of such techniques.

My guess would be that most short forward branches would not
use an extracted code cache either because they are generally
not taken so there is no fetch advantage or because the branch
direction in unpredictable such that predication likely makes
more sense and fetching from two structures just adds
complexity.

For highly unlikely code, an extracted cache might have
higher latency (from placement, from deferred access to use
better prediction, or from more complex retrieval). Stalling
renaming when a longer-latency insertion is predicted seems
undesirable (though it may have negligible performance harm),
but including just enough dataflow information in the quicker
accessed caches to support out-of-order fetch seems
complicated.
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Feb 9 19:20:01 2026

From Newsgroup: comp.arch

Paul Clayton <paaronclayton@gmail.com> posted:

On 11/5/25 3:43 PM, MitchAlsup wrote:
[snip]

I am now working on predictors for a 6-wide My 66000 machine--which is a bit
different.
a) VEC-LOOP loops do not alter the branch prediction tables.
b) Predication clauses do not alter the BPTs.

Not recording the history of predicates may have a negative
effect on global history predictors. (I do not know if anyone
has studied this, but it has been mentioned rCo e.g.,
"[predication] has a negative side-effect because the removal
of branches eliminates useful correlation information
necessary for conventional branch predictors" from "Improving
Branch Prediction and Predicated Execution in Out-of-Order
Processors", Eduardo Qui|#ones et al., 2007.)

It depends on where you are looking! If you think branch prediction
alters where FETCH is Fetching, then MY 66000 predication does not
do predication prediction--predication is used when the join point
will have already been fetched by the time the condition is known.
Then, either the then clause or the else clause will be nullified
without backup (i.e., branch prediction repair).

DECODE is still able to predict then-clause versus else-clause
and maintain the no-backup property, as long as both sides are
issued into the execution window.

Predicate prediction can also be useful when the availability
of the predicate is delayed. Similarly, selective eager
execution might be worthwhile when the predicate is delayed;
the selection is likely to be predictive (resource use might
be a basis for selection but even estimating that might be
predictive).

The difference is that predication prediction never needs branch
prediction repair.

--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Feb 18 14:45:36 2026

From Newsgroup: comp.arch

On 2/16/2026 3:14 PM, Paul Clayton wrote:

On 11/5/25 2:00 AM, BGB wrote:

On 11/4/2025 3:44 PM, Terje Mathisen wrote:

MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

[snip]

Branch prediction is fun.

When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for
the branch predictors.

You might be interested in looking at the 6th Championship
Branch Prediction (2025): https://ieeetcca.org/2025/02/18/6th- championship-branch-prediction-cbp2025/

Quick look, didn't see much information about who entered or won...

TAgged GEometric length predictors (TAGE) seem to the current
"hotness" for branch predictors. These record very long global
histories and fold them into shorter indexes with the number of
history bits used varying for different tables.

(Because the correlation is less strong, 3-bit counters are
generally used as well as a useful bit.)

When I messed with it, increasing the strength of the saturating
counters was not effective.

But, increasing the ability of them to predict more complex patterns did
help.

But, then always at the end of it using 2-bit saturating counters:
-a-a weakly taken, weakly not-taken, strongly taken, strongly not taken.

But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
-a-a Keep a local history of taken/not-taken;
-a-a XOR this with the low-order-bits of PC for the table index;
-a-a Use a 5/6-bit finite-state-machine or similar.
-a-a-a-a Can model repeating patterns up to ~ 4 bits.

Indexing a predictor by _local_ (i.e., per instruction address)
history adds a level of indirection; once one has the branch
(fetch) address one needs to index the local history and then
use that to index the predictor. The Alpha 21264 had a modest-
sized (by modern standards) local history predictor with a 1024-
entry table of ten history bits indexed by the Program Counter
and the ten bits were used to index a table of three-bit
counters. This was combined with a 12-bit global history
predictor with 2-bit counters (note: not gshare, i.e., xored
with the instruction address) and the same index was used for
the chooser.

OK, seems I wrote it wrong:
I was using a global branch history.

But, either way, the history is XOR'ed with the relevant bits from PC to generate the index.

I do not know if 5/6-bit state machines have been academically
examined for predictor entries. I suspect the extra storage is a
significant discouragement given one often wants to cover more
different correlations and branches.

If the 5/6-bit FSM can fit more patterns than 3x 2-bit saturating
counters, it can be a win.

As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.

With the PC XOR Hist, lookup, there were still quite a few patterns, and
not just contiguous 0s or 1s that the saturating counter would predict,
nor just all 0s, all 1s, or ...101010... that the 3-bit FSM could deal with.

But, a bigger history could mean less patterns and more contiguous bits
in the state.

TAGE has the advantage that the tags reduce branch aliases and
the variable history length (with history folding/compression)
allows using less storage (when a prediction only benefits from
a shorter history) and reduces training time.

In my case, was using 6-bit lookup mostly to fit into LUT6 based LUTRAM.

Going bigger than 6 bits here is a pain point for FPGAs, more so as
BRAM's don't support narrow lookups, so the next size up would likely be
2048x but then inefficiently using the BRAM (there isn't likely a
particularly good way to make use of 512x 18-bits).

Though, more efficient case here would likely be to use LUT5's (5-bit
index), or maybe MUXing LUT5's (for 128x 6b, maybe 512x 6b with 3-levels
of LUTs).

I guess, it is an open question if one did:
reg[4:0] predArray[511:0];
If the synthesis would figure out the most optimal pattern on its own,
vs needing to get more manual, with an array more like:
reg[95:0] predArray[31:0]; //fit to 5b index / 3b data.

With it padded up to 6-bits per entry to allow fitting it to the LUTRAM
1R1W pattern (densely packing to 80 bits would not fit the pattern).

But, yeah, if the size of the lookup were extended, it could make sense
to start hashing part of the history, say, maybe:

hhist[ 5: 0] = hist[5:0];
hhist[ 9: 6] = hist[9:6] ^ hist[13:10];
hhist[11:10] = hist[11:10] ^ hist[13:12] ^ hist[15:14];

Then, say:
index[11:0] = hhist[11:0] ^ { pc[12:2], pc[13] ^ pc[1] };

...

(I am a little surprised that I have not read a suggestion to
use the alternate 2-bit encoding that retains the last branch
direction. This history might be useful for global history; the
next most recent direction (i.e., not the predicted direction)
of previous recent branches for a given global history might be
useful in indexing a global history predictor. This 2-bit
encoding seems to give slightly worse predictions than a
saturating counter but the benefit of "localized" global history
might compensate for this.)

Alloyed prediction ("Alloyed Branch History: Combining Global
and Local Branch History for Robust Performance", Zhijian Lu et
al., 2002) used a tiny amount of local history to index a
(mostly) global history predictor, hiding (much of) the latency
of looking up the local history by retrieving multiple entries
from the table and selecting the appropriate one with the local
history.

There was also a proposal ("Branch Transition Rate: A New Metric
for Improved Branch Classification Analysis", Michael Huangs et
al., 2000) to consider transition rate, noting that high
transition rate branches (which flip direction frequently) are
poorly predicted by averaging behavior. (Obviously, loop-like
branches have a high transition rate for one direction.) This is
a limited type of local history. If I understand correctly, your
state machine mechanism would capture the behavior of such
highly alternately branches.

Compressing the history into a pattern does mean losing
information (as does a counter), but I had thought such pattern
storage might be more efficient that storing local history. It
is also interesting that the Alpha 21264 local predictor used
dynamic pattern matching rather than a static transformation of
history to prediction (state machine)

I think longer local history prediction has become unpopular,
probably because nothing like TAGE was proposed to support
longer histories but also because the number of branches that
can be tracked with long histories is smaller.

Local history patterns may also be less common that statistical
correlation after one has extracted branches predicted well by
global history. (For small-bodied loops, a moderately long
global history provides substantial local history.)

...

It seems what I wrote originally was inaccurate, I don't store a history per-target, merely it was recent taken/not-taken branches.

But, I no longer remember what I was thinking at the time, or why I had written local history rather than global history (unless I meant "local"
in terms of recency or something, I don't know).

Using a pattern/state machine may also make confidence
estimation less accurate. TAGE can use confidence of multiple
matches to form a prediction.

The use of confidence for making a prediction also makes it
impractical to store just a prediction nearby (to reduce
latency) and have the extra state more physically distant. For
per-address predictors, one could in theory use Icache
replacement to constrain predictor size, where an Icache miss
loads local predictions from an L2 local predictor. I think AMD
had limited local predictors associated with the Icache and had
previously stored some prediction information in L2 cache using
the fact that code is not modified (so parity could be used and
not ECC).

Daniel A. Jim|-nez did some research on using neural methods
(e.g., perceptrons) for branch prediction. The principle was
that traditional global history tables had exponential scaling
with history size (2 to the N table entries for N history bits)
while per-address perceptrons would scale linearly (for a fixed
number of branch addresses). TAGE (with its folding of long
histories and variable history) seems to have removed this as a
distinct benefit. General correlation with specific branches may
also be less predictive than correlation with path.
Nevertheless, the research was interesting and larger histories
did provide more accurate predictions.

Anyway, I agree that thinking about these things can be fun.

A lot of this seems a lot more complex though than what would be all
that practical on a Spartan or Artix class FPGA.

I was mostly using 5/6 bit state machines as they gave better results
than 2-bit saturating counters, and fit nicely within the constraints of
a "history XOR PC" lookup pattern.

Also initially, a while ago, it seemed that the patterns for these state machines were outside the reach of LLM AIs, but it seems the AIs are
quickly catching up at these sorts of mental puzzles.

Granted, I am mostly within the limits of cheap/free.

But, OTOH, paying a steep per-month fee so that "vibe coding" could take
my hobby from me isn't particularly compelling personally. But, even as
such, it does seem like if people can do so, my "existential merit" is weakening.

Even if as-is, apparently the people who try to do non-trivial projects
via "vibe coding" generally end up with some big mess of poorly written
code that falls on its face one the code gets big enough that the LLMs
can no longer reason about it.

Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).

Could model slightly more complex patterns than the 2-bit saturating
counters, but it is sort of a partial mystery why (for mainstream
processors) more complex lookup schemes and 2 bit state, was
preferable to a simpler lookup scheme and 5-bit state.

Well, apart from the relative "dark arts" needed to cram 4-bit
patterns into a 5 bit FSM (is a bit easier if limiting the patterns to
3 bits).

Then again, had before noted that the LLMs are seemingly also not
really able to figure out how to make a 5 bit FSM to model a full set
of 4 bit patterns.

Then again, I wouldn't expect it to be all that difficult of a problem
for someone that is "actually smart"; so presumably chip designers
could have done similar.

Well, unless maybe the argument is that 5 or 6 bits of storage would
cost more than 2 bits, but then presumably needing to have
significantly larger tables (to compensate for the relative predictive
weakness of 2-bit state) would have costed more than the cost of
smaller tables of 6 bit state ?...

Say, for example, 2b:
-a-a00_0 => 10_0-a //Weakly not-taken, dir=0, goes strong not-taken
-a-a00_1 => 01_0-a //Weakly not-taken, dir=1, goes weakly taken
-a-a01_0 => 00_1-a //Weakly taken, dir=0, goes weakly not-taken
-a-a01_1 => 11_1-a //Weakly taken, dir=1, goes strongly taken
-a-a10_0 => 10_0-a //strongly not taken, dir=0
-a-a10_1 => 00_0-a //strongly not taken, dir=1 (goes weak)
-a-a11_0 => 01_1-a //strongly taken, dir=0
-a-a11_1 => 11_1-a //strongly taken, dir=1 (goes weak)

Can expand it to 3-bits, for 2-bit patterns
-a-a As above, and 4-more alternating states
-a-a And slightly different transition logic.
Say (abbreviated):
-a-a 000-a-a weak, not taken
-a-a 001-a-a weak, taken
-a-a 010-a-a strong, not taken
-a-a 011-a-a strong, taken
-a-a 100-a-a weak, alternating, not-taken
-a-a 101-a-a weak, alternating, taken
-a-a 110-a-a strong, alternating, not-taken
-a-a 111-a-a strong, alternating, taken
The alternating states just flip-flopping between taken and not taken.
-a-a The weak states can more between any of the 4.
-a-a The strong states used if the pattern is reinforced.

Going up to 3 bit patterns is more of the same (add another bit,
doubling the number of states). Seemingly something goes nasty when
getting to 4 bit patterns though (and can't fit both weak and strong
states for longer patterns, so the 4b patterns effectively only exist
as weak states which partly overlap with the weak states for the 3-bit
patterns).

But, yeah, not going to type out state tables for these ones.

Not proven, but I suspect that an arbitrary 5 bit pattern within a 6
bit state might be impossible. Although there would be sufficient
state-space for the looping 5-bit patterns, there may not be
sufficient state-space to distinguish whether to move from a
mismatched 4-bit pattern to a 3 or 5 bit pattern. Whereas, at least
with 4-bit, any mismatch of the 4-bit pattern can always decay to a 3-
bit pattern, etc. One needs to be able to express decay both to
shorter patterns and to longer patterns, and I suspect at this point,
the pattern breaks down (but can't easily confirm; it is either this
or the pattern extends indefinitely, I don't know...).

Could almost have this sort of thing as a "brain teaser" puzzle or
something...

Then again, maybe other people would not find any particular
difficulty in these sorts of tasks.

Terje

--- Synchronet 3.21b-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Mon Feb 23 17:17:59 2026

From Newsgroup: comp.arch

On 2/18/26 3:45 PM, BGB wrote:

On 2/16/2026 3:14 PM, Paul Clayton wrote:

On 11/5/25 2:00 AM, BGB wrote:

On 11/4/2025 3:44 PM, Terje Mathisen wrote:

MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

[snip]

Branch prediction is fun.

When I looked around online before, a lot of stuff about
branch prediction was talking about fairly large and
convoluted schemes for the branch predictors.

You might be interested in looking at the 6th Championship
Branch Prediction (2025): https://ieeetcca.org/2025/02/18/6th-
championship-branch-prediction-cbp2025/

Quick look, didn't see much information about who entered or won...

This presentation (PDF) presents the winners: https://ericrotenberg.wordpress.ncsu.edu/files/2025/06/CBP2025-Closing-Remarks.pdf

1st Place: Toru Koizumi et al.'s "RUNLTS: Register-value-aware
predictor Utilizing Nested Large TableS"
PDF of paper: https://ericrotenberg.wordpress.ncsu.edu/files/2025/06/cbp2025-final44-Koizumi.pdf

2nd Place: Andr|- Seznec's "TAGE-SC for CBP2025"
PDF of paper: https://ericrotenberg.wordpress.ncsu.edu/files/2025/06/cbp2025-final37-Seznec.pdf

3rd Place: Yang Man, el. al's "LVCP: A Load-Value Correlated
Predictor for TAGE-SC-L"
PDF of paper: https://ericrotenberg.wordpress.ncsu.edu/files/2025/06/cbp2025-final15-Man.pdf

The program lists the competitors and provides links to papers,
presentations, video, and code: https://ericrotenberg.wordpress.ncsu.edu/cbp2025-workshop-program/

I probably should have provided the link to the program to save
any readers time and effort. (I had not remembered how many link
traversals were required and just picked a url from my Firefox
history.)

(One nice aspect of this workshop/competition is that one can
reproduce the results and try one's own designs. As long as one
can trust that the traces are representative of the workloads
one cares about, this seems significant.)

TAgged GEometric length predictors (TAGE) seem to the current
"hotness" for branch predictors. These record very long global
histories and fold them into shorter indexes with the number of
history bits used varying for different tables.

(Because the correlation is less strong, 3-bit counters are
generally used as well as a useful bit.)

When I messed with it, increasing the strength of the saturating
counters was not effective.

But, increasing the ability of them to predict more complex
patterns did help.

Branch prediction is heuristic and the tradeoffs change based on
the workload and the storage (and latency) budget.

Saturating counters record localized bias not a pattern, so
fuzzy behavior might be more accurately predicted on average
while (as you noted) repeated patterns are less likely to be
predicted accurately.

Larger counters can even out some irregularity in the bias but
increase training time. (With tags, a new entry can be
initialized to a decent state, which can reduce training time.)

For per-address prediction, there are ways of reducing the
overhead for tagging each entry. If a branch target buffer (BTB)
is used, one is already storing a target address (or at least an
offset/cache index) so adding some tag bits is not quite as
expensive and allows associativity; placing a per-address
prediction in a BTB was a common design. Per-address predictions
can also be associated with the Icache, in which case they can
reuse the Icache tagging. Tagging reduces aliases, but obviously
bits spent on tags cannot be spent on prediction entries
themselves (fewer entries means evictions that reduce training
time).

Aliasing is a major issue even with per-address prediction; with
global branch history aliasing is even more likely. Several
techniques have been proposed (besides tagging) to reduce
aliasing such as multiple diversely indexed predictors with a
majority vote. (The never manufactured Alpha 21464 design used
three differently indexed tables to produce a majority vote and
a fourth table to choose whether to use the majority vote or the
prediction form a specific table. See "Design Tradeoffs for the
Alpha EV8 Conditional Branch Predictor" for details.)

(Mitch Alsup would, of course, mention his agree predictor which
uses a second source such as a per-address prediction or a
static prediction to exploit that predictions are likely to
agree with their local prediction and two branches that use the
same index could have different direction biases. In addition,
if the global predictor is used as a complement to a per-address
predictor, different branches with the same global index may be
likely to disagree with a per-address predictor.)

Capacity issues can be so significant that for some workloads
single bit predictors outperform two-bit counters. Predicting
more static (per-address) branches somewhat accurately can be
more important than predicting fewer branches more accurately.

Counters also have the minor advantage of providing a confidence
estimate that can be used for things like prediction selection,
checkpoint selection (doing a renaming snapshot to speed
misprediction recovery), dynamic predication, prefetch
throttling, or execution throttling (to save power/energy).

But, then always at the end of it using 2-bit saturating
counters:
-a-a weakly taken, weakly not-taken, strongly taken, strongly
not taken.

But, in my fiddling, there was seemingly a simple but
moderately effective strategy:
-a-a Keep a local history of taken/not-taken;
-a-a XOR this with the low-order-bits of PC for the table index;
-a-a Use a 5/6-bit finite-state-machine or similar.
-a-a-a-a Can model repeating patterns up to ~ 4 bits.

Indexing a predictor by _local_ (i.e., per instruction address)
history adds a level of indirection; once one has the branch
(fetch) address one needs to index the local history and then
use that to index the predictor.

[snip]

OK, seems I wrote it wrong:
I was using a global branch history.

The terminology can be tricky. The worst is probably the
difference between bimode and bimodal; for former is a similar
to agree prediction in attempting to separate bias, the latter
is a name for per-address prediction (two modes, taken or not
taken).

But, either way, the history is XOR'ed with the relevant bits
from PC to generate the index.

This is called a gshare predictor as distinguished from a
gselect predictor which indexes with just the global history.
(There is also a distinction of path history rCo address bits
rather than taken/not-taken rCo and global branch history.) As you
probably discovered in your experimentation, hashing in the
address bits provides significantly better prediction.

I do not know if 5/6-bit state machines have been academically
examined for predictor entries. I suspect the extra storage is a
significant discouragement given one often wants to cover more
different correlations and branches.

If the 5/6-bit FSM can fit more patterns than 3x 2-bit
saturating counters, it can be a win.

I suspect it very much depends on whether bias or pattern is
dominant. This would depend on the workload (Doom?) and the
table size (and history length). I do not know that anyone in
academia has explored this, so I think you should be proud of
your discovery even if it has limited application.

A larger table (longer history) can mean longer training, but
such also discovers more patterns and longer patterns (e.g.,
predicting a fixed loop count). However, correlation strength
tends to decrease with increasing distance (having multiple
history lengths and hashings helps to find the right history).

As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.

When the pattern is exactly repeated this is great, but if the
correlation with global history is fuzzy (but biased) a counter
might be better.

I get the impression that branch prediction is complicated
enough that even experts only have a gist of what is actually
happening, i.e., there is a lot of craft and experimentation and
less logical derivation (I sense).

With the PC XOR Hist, lookup, there were still quite a few
patterns, and not just contiguous 0s or 1s that the saturating
counter would predict, nor just all 0s, all 1s, or ...101010...
that the 3-bit FSM could deal with.

But, a bigger history could mean less patterns and more
contiguous bits in the state.

TAGE has the advantage that the tags reduce branch aliases and
the variable history length (with history folding/compression)
allows using less storage (when a prediction only benefits from
a shorter history) and reduces training time.

In my case, was using 6-bit lookup mostly to fit into LUT6 based
LUTRAM.

Going bigger than 6 bits here is a pain point for FPGAs, more so
as BRAM's don't support narrow lookups, so the next size up
would likely be 2048x but then inefficiently using the BRAM
(there isn't likely a particularly good way to make use of 512x
18-bits).

Presumably selecting specific bits of the 18 (shifting/alignment
network) is expensive in an FPGA? With more bits per index, it
might also make sense to use associativity with partial tags. It
might be practical with two-way associativity to use different
numbers of tag bits (with different address hashings) in
different ways and different state sizes.

[snip]

Local history patterns may also be less common that statistical
correlation after one has extracted branches predicted well by
global history. (For small-bodied loops, a moderately long
global history provides substantial local history.)

It seems what I wrote originally was inaccurate, I don't store a
history per-target, merely it was recent taken/not-taken branches.

Well, I mixed up way and set in one of my comp.arch postings,
and this was after (if I recall correctly) having read Computer
Architecture: A Quantitative Approach as well as more than a
couple of papers and web articles and comp.arch postings.

But, I no longer remember what I was thinking at the time, or
why I had written local history rather than global history
(unless I meant "local" in terms of recency or something, I
don't know).

"Local" is not really a good term for per-address because of the
potential confusion with temporal locality and the generality (a
local branch prediction is not about predicting branches with
nearby addresses rCo I am skeptical that there is much "spatial
locality" for branch direction).

(Terminology in computer architecture is not very well
organized, consistent, or insightful. I feel "Instruction
Pointer" is a better term than "Program Counter" and
"Translation Cache" is a better term than "Translation Lookaside
Buffer"; the former is still strong because of x86, but the
latter had Itanium as a "backer"ry|. In terms of historical
precedence, "elbow cache" should be preferred for a cache that
uses different indexing methods to find an alternative index for
an evicted block rather than "cuckoo cache" which seems more
evocative and was used for hash tables. Then there is "block"
(IBM) versus "line" (Intel) and whether a "sector" is larger or
smaller than a line/block and what pipeline stage is "issue" and
which "dispatch".)

[snip]

A lot of this seems a lot more complex though than what would be
all that practical on a Spartan or Artix class FPGA.

Modern high performance branch predictors are huge! The branch
prediction section for AMD's Zen 4 is larger than the
instruction cache and is close to the size of (possibly larger
than) the instruction cache and decode sections: https://substackcdn.com/image/fetch/$s_!Ak84!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d037035-aad5-4ff5-931b-c51f0409e9f3_787x378.jpeg
(from the Chips & Cheese article "Zen 5rCOs 2-Ahead Branch
Predictor Unit: How a 30 Year Old Idea Allows for New Tricks" https://chipsandcheese.com/p/zen-5s-2-ahead-branch-predictor-unit-how-30-year-old-idea-allows-for-new-tricks
)

I was mostly using 5/6 bit state machines as they gave better
results than 2-bit saturating counters, and fit nicely within
the constraints of a "history XOR PC" lookup pattern.

I think it is very neat that you were experimenting with such.
--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Feb 25 17:40:11 2026

From Newsgroup: comp.arch

On 2/23/2026 4:17 PM, Paul Clayton wrote:

On 2/18/26 3:45 PM, BGB wrote:

snip, not feeling up to responding to everything at the moment.

So, for now, reevaluating things more narrowly.

I do not know if 5/6-bit state machines have been academically
examined for predictor entries. I suspect the extra storage is a
significant discouragement given one often wants to cover more
different correlations and branches.

If the 5/6-bit FSM can fit more patterns than 3x 2-bit saturating
counters, it can be a win.

I suspect it very much depends on whether bias or pattern is
dominant. This would depend on the workload (Doom?) and the
table size (and history length). I do not know that anyone in
academia has explored this, so I think you should be proud of
your discovery even if it has limited application.

A larger table (longer history) can mean longer training, but
such also discovers more patterns and longer patterns (e.g.,
predicting a fixed loop count). However, correlation strength
tends to decrease with increasing distance (having multiple
history lengths and hashings helps to find the right history).

Yeah.

I just decided to go off and to some slightly more extensive testing.

As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.

When the pattern is exactly repeated this is great, but if the
correlation with global history is fuzzy (but biased) a counter
might be better.

I get the impression that branch prediction is complicated
enough that even experts only have a gist of what is actually
happening, i.e., there is a lot of craft and experimentation and
less logical derivation (I sense).

I decided to try comparing a few options by sampling branch data and
then testing against it offline.

Though, the current empirically-sampled data is limited in scope
(currently covering running the Doom game loop, and with enough
sample-points to cover roughly 2 seconds of running time).

Even for a 50 MHz CPU, sampling more than a few seconds worth of
branches would eat a significant chunk of RAM and make for a fairly
large file.

Recording the startup process, launching Doom, and running part of the
game loop, while preferable, will take up too much space (if one is
recording both the PC address and branch direction).

In some other past contexts, it looked like the 5/6 bit FSM was the
clear winner. But, with Doom samples and offline testing, not so clear.

Here, context is the low-order bits of PC XORed with the recent branch history.

Miss rate from the collected data (with a 6-bit context):
1-bit, last bit : ~ 19.14%
2-bit, sat counter : ~ 14.87% (2nd place)
3-bit, sat counter : ~ 18.24%
5/6-bit, GA evolved : ~ 14.25% (winner)

3/4-bit, alternating (hand): ~ 35.95% (bad)
4/5-bit, 2/3 bit pat (hand): ~ 38.75% (bad)
5/6 (hand) : Untested (too much effort)
(Will leave out the hand-written FSMs as they seem to suck).
...

If context is reduced to 5 bits:
1-bit, last bit : ~ 30.55%
2-bit, sat counter : ~ 17.37% (2nd place)
3-bit, sat counter : ~ 21.69%
5/6-bit FSM : ~ 15.86% (winner)

If context is reduced to 4 bits:
1-bit, last bit : ~ 30.55%
2-bit, sat counter : ~ 22.15% (2nd place)
3-bit, sat counter : ~ 29.04%
5/6-bit FSM : ~ 19.25% (winner)

If context is increased to 8 bits:
1-bit, last bit : ~ 13.07%
2-bit, sat counter : ~ 12.32% (winner)
3-bit, sat counter : ~ 13.96%
5/6-bit FSM : ~ 12.79% (2nd place)

If context is increased to 12 bits:
1-bit, last bit : ~ 9.46% (winner)
2-bit, sat counter : ~ 9.85% (2nd place)
3-bit, sat counter : ~ 10.38%
5/6-bit FSM : ~ 10.70%

So, it seems increasing the context size causes the FSM to lose its
advantage (mostly as more of the context seems to be dominated by
single-bit patterns; and accuracy based on how quickly it can adapt).

the 3-bit saturating counter here is nearly always worse than the 2-bit saturating counter.

If optimizing for "most effectiveness per bit of storage":
8-bit context + 2-bit sat counter would win;
If optimizing for FPGA resource cost: 5 and 6 bit contexts would win.

Did go any modify the GA table evolver to also use the sampled data as reference for the initial evolver step (along with the original purely synthetic patterns). This appears to have maybe helped.

Slightly faster adaptation and noise tolerance; but margins seem small
(can't rule out the possibility of the difference being due to
RNG/"butterfly effect" or similar, rather than due to training on "real
world" data vs purely synthetic patterns).

With 6-bit context, sampled data vs pure synthetic:
5/6-bit, GA evolved (real) : ~ 14.25% (winner)
5/6-bit, GA evolved (synth): ~ 15.33%

Where, the inclusion of real sampled data in the training process does
appear to improve accuracy over a version trained purely on synthetic patterns.

The using a genetic algorithm to generate the tables seems to vastly outperform my ability to naively hand-fill the tables.

I suspect this is because the GA is able to generate tables which adapt
to changing patterns more quickly.

The hand-filled tables tend to need a to get back to "hub states" which
then migrate to the target state for the pattern. The GA-generated
tables seem able to side-step the need for hub-states and adapt to the
change in pattern within around a few bits (if incrementing the
bit-pattern, is seems to have a 2-bit lead-in).

But, the need for hub states does put the hand-filled patterns at a disadvantage relative to the saturating counters, which as noted, does
not apply to whatever the genetic-algorithm is doing.

So, it seems like the hand-written tables are at a serious disadvantage
here.

I did verify that it was still able to adapt to arbitrary repeating 3
and 4 bit patterns (and accurately express these patterns).

with the main variability being that the GA-evolved patterns seem to
adapt more quickly and are also better able to tolerate "noise".

But, alas, this doesn't exactly mean that a 5/6 bit FSM is a clear
winner over a 2-bit saturating counter in the general case.

Comparing against branch values from initial boot process (6b ctx):
1-bit, last bit : ~ 11.39%
2-bit, sat counter : ~ 6.48%
3-bit, sat counter : ~ 33.86%
5/6-bit, GA evolved (real) : ~ 4.90%
5/6-bit, GA evolved (synth): ~ 5.02%

Granted, this was the context that originally led me to choose the 5/6
bit FSM as the clear winner over the 2-bit saturating counter.

And, for reasons, I mostly ended up not using hand-written FSMs in any
of these cases as, I don't understand whatever black arts the GA is
using here, nor can I replicate the same level of performance (also,
even if I have since figured out how to do so, writing out such an FSM
by hand is still a PITA).

So, alas, the use of a genetic algorithm seems superior at this task...

In this case, a C version of the 5/6 bit FSM looking like:
static byte fsmtab_5b[64]={
0x32, 0x20, 0x3E, 0x3B, 0x18, 0x02, 0x31, 0x13,
0x0C, 0x10, 0x06, 0x16, 0x1E, 0x24, 0x06, 0x12,
0x0F, 0x0B, 0x01, 0x2B, 0x34, 0x19, 0x1D, 0x2C,
0x33, 0x37, 0x26, 0x1B, 0x0D, 0x11, 0x08, 0x03,
0x0D, 0x39, 0x3D, 0x2A, 0x01, 0x13, 0x26, 0x37,
0x34, 0x36, 0x3F, 0x2D, 0x15, 0x21, 0x0E, 0x2B,
0x29, 0x2E, 0x38, 0x04, 0x0C, 0x25, 0x00, 0x17,
0x30, 0x2F, 0x09, 0x03, 0x14, 0x05, 0x3D, 0x23,
};

With the input/output data bit in the LSB (other 5 bits being state).

...

--- Synchronet 3.21b-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	59
Nodes:	6 (0 / 6)
Uptime:	00:17:41
Calls:	810
Files:	1,287
Messages:	197,932

Re: branch splitting

Who's Online

System Info