I am now working on predictors for a 6-wide My 66000 machine--which is a bit different.
a) VEC-LOOP loops do not alter the branch prediction tables.
b) Predication clauses do not alter the BPTs.
On 11/5/25 3:43 PM, MitchAlsup wrote:
[snip]
I am now working on predictors for a 6-wide My 66000 machine--which is a bit
different.
a) VEC-LOOP loops do not alter the branch prediction tables.
b) Predication clauses do not alter the BPTs.
Not recording the history of predicates may have a negative
effect on global history predictors. (I do not know if anyone
has studied this, but it has been mentioned rCo e.g.,
"[predication] has a negative side-effect because the removal
of branches eliminates useful correlation information
necessary for conventional branch predictors" from "Improving
Branch Prediction and Predicated Execution in Out-of-Order
Processors", Eduardo Qui|#ones et al., 2007.)
Predicate prediction can also be useful when the availability
of the predicate is delayed. Similarly, selective eager
execution might be worthwhile when the predicate is delayed;
the selection is likely to be predictive (resource use might
be a basis for selection but even estimating that might be
predictive).
On 11/5/25 2:00 AM, BGB wrote:
On 11/4/2025 3:44 PM, Terje Mathisen wrote:[snip]
MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Branch prediction is fun.
When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for
the branch predictors.
You might be interested in looking at the 6th Championship
Branch Prediction (2025): https://ieeetcca.org/2025/02/18/6th- championship-branch-prediction-cbp2025/
TAgged GEometric length predictors (TAGE) seem to the current
"hotness" for branch predictors. These record very long global
histories and fold them into shorter indexes with the number of
history bits used varying for different tables.
(Because the correlation is less strong, 3-bit counters are
generally used as well as a useful bit.)
But, then always at the end of it using 2-bit saturating counters:
-a-a weakly taken, weakly not-taken, strongly taken, strongly not taken.
But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
-a-a Keep a local history of taken/not-taken;
-a-a XOR this with the low-order-bits of PC for the table index;
-a-a Use a 5/6-bit finite-state-machine or similar.
-a-a-a-a Can model repeating patterns up to ~ 4 bits.
Indexing a predictor by _local_ (i.e., per instruction address)
history adds a level of indirection; once one has the branch
(fetch) address one needs to index the local history and then
use that to index the predictor. The Alpha 21264 had a modest-
sized (by modern standards) local history predictor with a 1024-
entry table of ten history bits indexed by the Program Counter
and the ten bits were used to index a table of three-bit
counters. This was combined with a 12-bit global history
predictor with 2-bit counters (note: not gshare, i.e., xored
with the instruction address) and the same index was used for
the chooser.
I do not know if 5/6-bit state machines have been academically
examined for predictor entries. I suspect the extra storage is a
significant discouragement given one often wants to cover more
different correlations and branches.
TAGE has the advantage that the tags reduce branch aliases and
the variable history length (with history folding/compression)
allows using less storage (when a prediction only benefits from
a shorter history) and reduces training time.
(I am a little surprised that I have not read a suggestion to
use the alternate 2-bit encoding that retains the last branch
direction. This history might be useful for global history; the
next most recent direction (i.e., not the predicted direction)
of previous recent branches for a given global history might be
useful in indexing a global history predictor. This 2-bit
encoding seems to give slightly worse predictions than a
saturating counter but the benefit of "localized" global history
might compensate for this.)
Alloyed prediction ("Alloyed Branch History: Combining Global
and Local Branch History for Robust Performance", Zhijian Lu et
al., 2002) used a tiny amount of local history to index a
(mostly) global history predictor, hiding (much of) the latency
of looking up the local history by retrieving multiple entries
from the table and selecting the appropriate one with the local
history.
There was also a proposal ("Branch Transition Rate: A New Metric
for Improved Branch Classification Analysis", Michael Huangs et
al., 2000) to consider transition rate, noting that high
transition rate branches (which flip direction frequently) are
poorly predicted by averaging behavior. (Obviously, loop-like
branches have a high transition rate for one direction.) This is
a limited type of local history. If I understand correctly, your
state machine mechanism would capture the behavior of such
highly alternately branches.
Compressing the history into a pattern does mean losing
information (as does a counter), but I had thought such pattern
storage might be more efficient that storing local history. It
is also interesting that the Alpha 21264 local predictor used
dynamic pattern matching rather than a static transformation of
history to prediction (state machine)
I think longer local history prediction has become unpopular,
probably because nothing like TAGE was proposed to support
longer histories but also because the number of branches that
can be tracked with long histories is smaller.
Local history patterns may also be less common that statistical
correlation after one has extracted branches predicted well by
global history. (For small-bodied loops, a moderately long
global history provides substantial local history.)
Using a pattern/state machine may also make confidence
estimation less accurate. TAGE can use confidence of multiple
matches to form a prediction.
The use of confidence for making a prediction also makes it
impractical to store just a prediction nearby (to reduce
latency) and have the extra state more physically distant. For
per-address predictors, one could in theory use Icache
replacement to constrain predictor size, where an Icache miss
loads local predictions from an L2 local predictor. I think AMD
had limited local predictors associated with the Icache and had
previously stored some prediction information in L2 cache using
the fact that code is not modified (so parity could be used and
not ECC).
Daniel A. Jim|-nez did some research on using neural methods
(e.g., perceptrons) for branch prediction. The principle was
that traditional global history tables had exponential scaling
with history size (2 to the N table entries for N history bits)
while per-address perceptrons would scale linearly (for a fixed
number of branch addresses). TAGE (with its folding of long
histories and variable history) seems to have removed this as a
distinct benefit. General correlation with specific branches may
also be less predictive than correlation with path.
Nevertheless, the research was interesting and larger histories
did provide more accurate predictions.
Anyway, I agree that thinking about these things can be fun.
Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).
Could model slightly more complex patterns than the 2-bit saturating
counters, but it is sort of a partial mystery why (for mainstream
processors) more complex lookup schemes and 2 bit state, was
preferable to a simpler lookup scheme and 5-bit state.
Well, apart from the relative "dark arts" needed to cram 4-bit
patterns into a 5 bit FSM (is a bit easier if limiting the patterns to
3 bits).
Then again, had before noted that the LLMs are seemingly also not
really able to figure out how to make a 5 bit FSM to model a full set
of 4 bit patterns.
Then again, I wouldn't expect it to be all that difficult of a problem
for someone that is "actually smart"; so presumably chip designers
could have done similar.
Well, unless maybe the argument is that 5 or 6 bits of storage would
cost more than 2 bits, but then presumably needing to have
significantly larger tables (to compensate for the relative predictive
weakness of 2-bit state) would have costed more than the cost of
smaller tables of 6 bit state ?...
Say, for example, 2b:
-a-a00_0 => 10_0-a //Weakly not-taken, dir=0, goes strong not-taken
-a-a00_1 => 01_0-a //Weakly not-taken, dir=1, goes weakly taken
-a-a01_0 => 00_1-a //Weakly taken, dir=0, goes weakly not-taken
-a-a01_1 => 11_1-a //Weakly taken, dir=1, goes strongly taken
-a-a10_0 => 10_0-a //strongly not taken, dir=0
-a-a10_1 => 00_0-a //strongly not taken, dir=1 (goes weak)
-a-a11_0 => 01_1-a //strongly taken, dir=0
-a-a11_1 => 11_1-a //strongly taken, dir=1 (goes weak)
Can expand it to 3-bits, for 2-bit patterns
-a-a As above, and 4-more alternating states
-a-a And slightly different transition logic.
Say (abbreviated):
-a-a 000-a-a weak, not taken
-a-a 001-a-a weak, taken
-a-a 010-a-a strong, not taken
-a-a 011-a-a strong, taken
-a-a 100-a-a weak, alternating, not-taken
-a-a 101-a-a weak, alternating, taken
-a-a 110-a-a strong, alternating, not-taken
-a-a 111-a-a strong, alternating, taken
The alternating states just flip-flopping between taken and not taken.
-a-a The weak states can more between any of the 4.
-a-a The strong states used if the pattern is reinforced.
Going up to 3 bit patterns is more of the same (add another bit,
doubling the number of states). Seemingly something goes nasty when
getting to 4 bit patterns though (and can't fit both weak and strong
states for longer patterns, so the 4b patterns effectively only exist
as weak states which partly overlap with the weak states for the 3-bit
patterns).
But, yeah, not going to type out state tables for these ones.
Not proven, but I suspect that an arbitrary 5 bit pattern within a 6
bit state might be impossible. Although there would be sufficient
state-space for the looping 5-bit patterns, there may not be
sufficient state-space to distinguish whether to move from a
mismatched 4-bit pattern to a 3 or 5 bit pattern. Whereas, at least
with 4-bit, any mismatch of the 4-bit pattern can always decay to a 3-
bit pattern, etc. One needs to be able to express decay both to
shorter patterns and to longer patterns, and I suspect at this point,
the pattern breaks down (but can't easily confirm; it is either this
or the pattern extends indefinitely, I don't know...).
Could almost have this sort of thing as a "brain teaser" puzzle or
something...
Then again, maybe other people would not find any particular
difficulty in these sorts of tasks.
Terje
On 2/16/2026 3:14 PM, Paul Clayton wrote:
On 11/5/25 2:00 AM, BGB wrote:
On 11/4/2025 3:44 PM, Terje Mathisen wrote:[snip]
MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Branch prediction is fun.
When I looked around online before, a lot of stuff about
branch prediction was talking about fairly large and
convoluted schemes for the branch predictors.
You might be interested in looking at the 6th Championship
Branch Prediction (2025): https://ieeetcca.org/2025/02/18/6th-
championship-branch-prediction-cbp2025/
Quick look, didn't see much information about who entered or won...
TAgged GEometric length predictors (TAGE) seem to the current
"hotness" for branch predictors. These record very long global
histories and fold them into shorter indexes with the number of
history bits used varying for different tables.
(Because the correlation is less strong, 3-bit counters are
generally used as well as a useful bit.)
When I messed with it, increasing the strength of the saturating
counters was not effective.
But, increasing the ability of them to predict more complex
patterns did help.
[snip]But, then always at the end of it using 2-bit saturating
counters:
-a-a weakly taken, weakly not-taken, strongly taken, strongly
not taken.
But, in my fiddling, there was seemingly a simple but
moderately effective strategy:
-a-a Keep a local history of taken/not-taken;
-a-a XOR this with the low-order-bits of PC for the table index;
-a-a Use a 5/6-bit finite-state-machine or similar.
-a-a-a-a Can model repeating patterns up to ~ 4 bits.
Indexing a predictor by _local_ (i.e., per instruction address)
history adds a level of indirection; once one has the branch
(fetch) address one needs to index the local history and then
use that to index the predictor.
OK, seems I wrote it wrong:
I was using a global branch history.
But, either way, the history is XOR'ed with the relevant bits
from PC to generate the index.
I do not know if 5/6-bit state machines have been academically
examined for predictor entries. I suspect the extra storage is a
significant discouragement given one often wants to cover more
different correlations and branches.
If the 5/6-bit FSM can fit more patterns than 3x 2-bit
saturating counters, it can be a win.
As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.
With the PC XOR Hist, lookup, there were still quite a few
patterns, and not just contiguous 0s or 1s that the saturating
counter would predict, nor just all 0s, all 1s, or ...101010...
that the 3-bit FSM could deal with.
But, a bigger history could mean less patterns and more
contiguous bits in the state.
TAGE has the advantage that the tags reduce branch aliases and
the variable history length (with history folding/compression)
allows using less storage (when a prediction only benefits from
a shorter history) and reduces training time.
In my case, was using 6-bit lookup mostly to fit into LUT6 based
LUTRAM.
Going bigger than 6 bits here is a pain point for FPGAs, more so
as BRAM's don't support narrow lookups, so the next size up
would likely be 2048x but then inefficiently using the BRAM
(there isn't likely a particularly good way to make use of 512x
18-bits).
Local history patterns may also be less common that statistical
correlation after one has extracted branches predicted well by
global history. (For small-bodied loops, a moderately long
global history provides substantial local history.)
It seems what I wrote originally was inaccurate, I don't store a
history per-target, merely it was recent taken/not-taken branches.
But, I no longer remember what I was thinking at the time, or
why I had written local history rather than global history
(unless I meant "local" in terms of recency or something, I
don't know).
A lot of this seems a lot more complex though than what would be
all that practical on a Spartan or Artix class FPGA.
I was mostly using 5/6 bit state machines as they gave better
results than 2-bit saturating counters, and fit nicely within
the constraints of a "history XOR PC" lookup pattern.
On 2/18/26 3:45 PM, BGB wrote:
I do not know if 5/6-bit state machines have been academically
examined for predictor entries. I suspect the extra storage is a
significant discouragement given one often wants to cover more
different correlations and branches.
If the 5/6-bit FSM can fit more patterns than 3x 2-bit saturating
counters, it can be a win.
I suspect it very much depends on whether bias or pattern is
dominant. This would depend on the workload (Doom?) and the
table size (and history length). I do not know that anyone in
academia has explored this, so I think you should be proud of
your discovery even if it has limited application.
A larger table (longer history) can mean longer training, but
such also discovers more patterns and longer patterns (e.g.,
predicting a fixed loop count). However, correlation strength
tends to decrease with increasing distance (having multiple
history lengths and hashings helps to find the right history).
As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.
When the pattern is exactly repeated this is great, but if the
correlation with global history is fuzzy (but biased) a counter
might be better.
I get the impression that branch prediction is complicated
enough that even experts only have a gist of what is actually
happening, i.e., there is a lot of craft and experimentation and
less logical derivation (I sense).
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 59 |
| Nodes: | 6 (0 / 6) |
| Uptime: | 00:17:41 |
| Calls: | 810 |
| Files: | 1,287 |
| Messages: | 197,932 |