• Re: branch splitting

    From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 10:24:54 2026
    From Newsgroup: comp.arch

    On 11/5/25 3:43 PM, MitchAlsup wrote:
    [snip]
    I am now working on predictors for a 6-wide My 66000 machine--which is a bit different.
    a) VEC-LOOP loops do not alter the branch prediction tables.
    b) Predication clauses do not alter the BPTs.

    Not recording the history of predicates may have a negative
    effect on global history predictors. (I do not know if anyone
    has studied this, but it has been mentioned rCo e.g.,
    "[predication] has a negative side-effect because the removal
    of branches eliminates useful correlation information
    necessary for conventional branch predictors" from "Improving
    Branch Prediction and Predicated Execution in Out-of-Order
    Processors", Eduardo Qui|#ones et al., 2007.)

    Predicate prediction can also be useful when the availability
    of the predicate is delayed. Similarly, selective eager
    execution might be worthwhile when the predicate is delayed;
    the selection is likely to be predictive (resource use might
    be a basis for selection but even estimating that might be
    predictive).

    With predication of all short forward branches in order to
    avoid fetch bubbles, the impact of delayed predicate
    availability and missing information for branch prediction
    may be greater than for more selective predication.

    There may also be some short forward branches that are 99%
    taken such that converting to a longer branch with a jump back
    may be a better option. With trace-cache-like optimization,
    such branched over code could be removed from fetch even when
    the compiler used a short branch. Dynamic code organization
    has the advantage of being able to use dynamically available
    information (and the disadvantage of gathering the information
    and making a decision dynamically).

    Something like a branch target cache could store extracted
    instructions. This might facilitate stitching such
    instructions back into the instruction stream with limited
    overhead. Since this would only work for usually taken
    hammock branches, it would probably not be worthwhile. For
    if-then-else constructs, one might place both paths in
    separate entries in such a target cache and always stitch
    in one of them, but that seems wonky.

    I rather doubt the benefits of such would justify the added
    complexity rCo almost certainly not in a first or second
    implementation of an architecture rCo but I would not want to
    reject future possible adoption of such techniques.

    My guess would be that most short forward branches would not
    use an extracted code cache either because they are generally
    not taken so there is no fetch advantage or because the branch
    direction in unpredictable such that predication likely makes
    more sense and fetching from two structures just adds
    complexity.

    For highly unlikely code, an extracted cache might have
    higher latency (from placement, from deferred access to use
    better prediction, or from more complex retrieval). Stalling
    renaming when a longer-latency insertion is predicted seems
    undesirable (though it may have negligible performance harm),
    but including just enough dataflow information in the quicker
    accessed caches to support out-of-order fetch seems
    complicated.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Feb 9 19:20:01 2026
    From Newsgroup: comp.arch


    Paul Clayton <paaronclayton@gmail.com> posted:

    On 11/5/25 3:43 PM, MitchAlsup wrote:
    [snip]
    I am now working on predictors for a 6-wide My 66000 machine--which is a bit
    different.
    a) VEC-LOOP loops do not alter the branch prediction tables.
    b) Predication clauses do not alter the BPTs.

    Not recording the history of predicates may have a negative
    effect on global history predictors. (I do not know if anyone
    has studied this, but it has been mentioned rCo e.g.,
    "[predication] has a negative side-effect because the removal
    of branches eliminates useful correlation information
    necessary for conventional branch predictors" from "Improving
    Branch Prediction and Predicated Execution in Out-of-Order
    Processors", Eduardo Qui|#ones et al., 2007.)

    It depends on where you are looking! If you think branch prediction
    alters where FETCH is Fetching, then MY 66000 predication does not
    do predication prediction--predication is used when the join point
    will have already been fetched by the time the condition is known.
    Then, either the then clause or the else clause will be nullified
    without backup (i.e., branch prediction repair).

    DECODE is still able to predict then-clause versus else-clause
    and maintain the no-backup property, as long as both sides are
    issued into the execution window.

    Predicate prediction can also be useful when the availability
    of the predicate is delayed. Similarly, selective eager
    execution might be worthwhile when the predicate is delayed;
    the selection is likely to be predictive (resource use might
    be a basis for selection but even estimating that might be
    predictive).

    The difference is that predication prediction never needs branch
    prediction repair.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Feb 18 14:45:36 2026
    From Newsgroup: comp.arch

    On 2/16/2026 3:14 PM, Paul Clayton wrote:
    On 11/5/25 2:00 AM, BGB wrote:
    On 11/4/2025 3:44 PM, Terje Mathisen wrote:
    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    [snip]
    Branch prediction is fun.

    When I looked around online before, a lot of stuff about branch
    prediction was talking about fairly large and convoluted schemes for
    the branch predictors.

    You might be interested in looking at the 6th Championship
    Branch Prediction (2025): https://ieeetcca.org/2025/02/18/6th- championship-branch-prediction-cbp2025/


    Quick look, didn't see much information about who entered or won...


    TAgged GEometric length predictors (TAGE) seem to the current
    "hotness" for branch predictors. These record very long global
    histories and fold them into shorter indexes with the number of
    history bits used varying for different tables.

    (Because the correlation is less strong, 3-bit counters are
    generally used as well as a useful bit.)


    When I messed with it, increasing the strength of the saturating
    counters was not effective.

    But, increasing the ability of them to predict more complex patterns did
    help.



    But, then always at the end of it using 2-bit saturating counters:
    -a-a weakly taken, weakly not-taken, strongly taken, strongly not taken.

    But, in my fiddling, there was seemingly a simple but moderately
    effective strategy:
    -a-a Keep a local history of taken/not-taken;
    -a-a XOR this with the low-order-bits of PC for the table index;
    -a-a Use a 5/6-bit finite-state-machine or similar.
    -a-a-a-a Can model repeating patterns up to ~ 4 bits.

    Indexing a predictor by _local_ (i.e., per instruction address)
    history adds a level of indirection; once one has the branch
    (fetch) address one needs to index the local history and then
    use that to index the predictor. The Alpha 21264 had a modest-
    sized (by modern standards) local history predictor with a 1024-
    entry table of ten history bits indexed by the Program Counter
    and the ten bits were used to index a table of three-bit
    counters. This was combined with a 12-bit global history
    predictor with 2-bit counters (note: not gshare, i.e., xored
    with the instruction address) and the same index was used for
    the chooser.


    OK, seems I wrote it wrong:
    I was using a global branch history.

    But, either way, the history is XOR'ed with the relevant bits from PC to generate the index.


    I do not know if 5/6-bit state machines have been academically
    examined for predictor entries. I suspect the extra storage is a
    significant discouragement given one often wants to cover more
    different correlations and branches.


    If the 5/6-bit FSM can fit more patterns than 3x 2-bit saturating
    counters, it can be a win.


    As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.

    With the PC XOR Hist, lookup, there were still quite a few patterns, and
    not just contiguous 0s or 1s that the saturating counter would predict,
    nor just all 0s, all 1s, or ...101010... that the 3-bit FSM could deal with.


    But, a bigger history could mean less patterns and more contiguous bits
    in the state.



    TAGE has the advantage that the tags reduce branch aliases and
    the variable history length (with history folding/compression)
    allows using less storage (when a prediction only benefits from
    a shorter history) and reduces training time.


    In my case, was using 6-bit lookup mostly to fit into LUT6 based LUTRAM.

    Going bigger than 6 bits here is a pain point for FPGAs, more so as
    BRAM's don't support narrow lookups, so the next size up would likely be
    2048x but then inefficiently using the BRAM (there isn't likely a
    particularly good way to make use of 512x 18-bits).


    Though, more efficient case here would likely be to use LUT5's (5-bit
    index), or maybe MUXing LUT5's (for 128x 6b, maybe 512x 6b with 3-levels
    of LUTs).


    I guess, it is an open question if one did:
    reg[4:0] predArray[511:0];
    If the synthesis would figure out the most optimal pattern on its own,
    vs needing to get more manual, with an array more like:
    reg[95:0] predArray[31:0]; //fit to 5b index / 3b data.

    With it padded up to 6-bits per entry to allow fitting it to the LUTRAM
    1R1W pattern (densely packing to 80 bits would not fit the pattern).


    But, yeah, if the size of the lookup were extended, it could make sense
    to start hashing part of the history, say, maybe:

    hhist[ 5: 0] = hist[5:0];
    hhist[ 9: 6] = hist[9:6] ^ hist[13:10];
    hhist[11:10] = hist[11:10] ^ hist[13:12] ^ hist[15:14];

    Then, say:
    index[11:0] = hhist[11:0] ^ { pc[12:2], pc[13] ^ pc[1] };

    ...


    (I am a little surprised that I have not read a suggestion to
    use the alternate 2-bit encoding that retains the last branch
    direction. This history might be useful for global history; the
    next most recent direction (i.e., not the predicted direction)
    of previous recent branches for a given global history might be
    useful in indexing a global history predictor. This 2-bit
    encoding seems to give slightly worse predictions than a
    saturating counter but the benefit of "localized" global history
    might compensate for this.)

    Alloyed prediction ("Alloyed Branch History: Combining Global
    and Local Branch History for Robust Performance", Zhijian Lu et
    al., 2002) used a tiny amount of local history to index a
    (mostly) global history predictor, hiding (much of) the latency
    of looking up the local history by retrieving multiple entries
    from the table and selecting the appropriate one with the local
    history.

    There was also a proposal ("Branch Transition Rate: A New Metric
    for Improved Branch Classification Analysis", Michael Huangs et
    al., 2000) to consider transition rate, noting that high
    transition rate branches (which flip direction frequently) are
    poorly predicted by averaging behavior. (Obviously, loop-like
    branches have a high transition rate for one direction.) This is
    a limited type of local history. If I understand correctly, your
    state machine mechanism would capture the behavior of such
    highly alternately branches.

    Compressing the history into a pattern does mean losing
    information (as does a counter), but I had thought such pattern
    storage might be more efficient that storing local history. It
    is also interesting that the Alpha 21264 local predictor used
    dynamic pattern matching rather than a static transformation of
    history to prediction (state machine)

    I think longer local history prediction has become unpopular,
    probably because nothing like TAGE was proposed to support
    longer histories but also because the number of branches that
    can be tracked with long histories is smaller.

    Local history patterns may also be less common that statistical
    correlation after one has extracted branches predicted well by
    global history. (For small-bodied loops, a moderately long
    global history provides substantial local history.)


    ...

    It seems what I wrote originally was inaccurate, I don't store a history per-target, merely it was recent taken/not-taken branches.

    But, I no longer remember what I was thinking at the time, or why I had written local history rather than global history (unless I meant "local"
    in terms of recency or something, I don't know).



    Using a pattern/state machine may also make confidence
    estimation less accurate. TAGE can use confidence of multiple
    matches to form a prediction.

    The use of confidence for making a prediction also makes it
    impractical to store just a prediction nearby (to reduce
    latency) and have the extra state more physically distant. For
    per-address predictors, one could in theory use Icache
    replacement to constrain predictor size, where an Icache miss
    loads local predictions from an L2 local predictor. I think AMD
    had limited local predictors associated with the Icache and had
    previously stored some prediction information in L2 cache using
    the fact that code is not modified (so parity could be used and
    not ECC).

    Daniel A. Jim|-nez did some research on using neural methods
    (e.g., perceptrons) for branch prediction. The principle was
    that traditional global history tables had exponential scaling
    with history size (2 to the N table entries for N history bits)
    while per-address perceptrons would scale linearly (for a fixed
    number of branch addresses). TAGE (with its folding of long
    histories and variable history) seems to have removed this as a
    distinct benefit. General correlation with specific branches may
    also be less predictive than correlation with path.
    Nevertheless, the research was interesting and larger histories
    did provide more accurate predictions.

    Anyway, I agree that thinking about these things can be fun.


    A lot of this seems a lot more complex though than what would be all
    that practical on a Spartan or Artix class FPGA.


    I was mostly using 5/6 bit state machines as they gave better results
    than 2-bit saturating counters, and fit nicely within the constraints of
    a "history XOR PC" lookup pattern.


    Also initially, a while ago, it seemed that the patterns for these state machines were outside the reach of LLM AIs, but it seems the AIs are
    quickly catching up at these sorts of mental puzzles.

    Granted, I am mostly within the limits of cheap/free.


    But, OTOH, paying a steep per-month fee so that "vibe coding" could take
    my hobby from me isn't particularly compelling personally. But, even as
    such, it does seem like if people can do so, my "existential merit" is weakening.

    Even if as-is, apparently the people who try to do non-trivial projects
    via "vibe coding" generally end up with some big mess of poorly written
    code that falls on its face one the code gets big enough that the LLMs
    can no longer reason about it.







    Where, the idea was that the state-machine in updated with the current
    state and branch direction, giving the next state and next predicted
    branch direction (for this state).


    Could model slightly more complex patterns than the 2-bit saturating
    counters, but it is sort of a partial mystery why (for mainstream
    processors) more complex lookup schemes and 2 bit state, was
    preferable to a simpler lookup scheme and 5-bit state.

    Well, apart from the relative "dark arts" needed to cram 4-bit
    patterns into a 5 bit FSM (is a bit easier if limiting the patterns to
    3 bits).



    Then again, had before noted that the LLMs are seemingly also not
    really able to figure out how to make a 5 bit FSM to model a full set
    of 4 bit patterns.


    Then again, I wouldn't expect it to be all that difficult of a problem
    for someone that is "actually smart"; so presumably chip designers
    could have done similar.

    Well, unless maybe the argument is that 5 or 6 bits of storage would
    cost more than 2 bits, but then presumably needing to have
    significantly larger tables (to compensate for the relative predictive
    weakness of 2-bit state) would have costed more than the cost of
    smaller tables of 6 bit state ?...

    Say, for example, 2b:
    -a-a00_0 => 10_0-a //Weakly not-taken, dir=0, goes strong not-taken
    -a-a00_1 => 01_0-a //Weakly not-taken, dir=1, goes weakly taken
    -a-a01_0 => 00_1-a //Weakly taken, dir=0, goes weakly not-taken
    -a-a01_1 => 11_1-a //Weakly taken, dir=1, goes strongly taken
    -a-a10_0 => 10_0-a //strongly not taken, dir=0
    -a-a10_1 => 00_0-a //strongly not taken, dir=1 (goes weak)
    -a-a11_0 => 01_1-a //strongly taken, dir=0
    -a-a11_1 => 11_1-a //strongly taken, dir=1 (goes weak)

    Can expand it to 3-bits, for 2-bit patterns
    -a-a As above, and 4-more alternating states
    -a-a And slightly different transition logic.
    Say (abbreviated):
    -a-a 000-a-a weak, not taken
    -a-a 001-a-a weak, taken
    -a-a 010-a-a strong, not taken
    -a-a 011-a-a strong, taken
    -a-a 100-a-a weak, alternating, not-taken
    -a-a 101-a-a weak, alternating, taken
    -a-a 110-a-a strong, alternating, not-taken
    -a-a 111-a-a strong, alternating, taken
    The alternating states just flip-flopping between taken and not taken.
    -a-a The weak states can more between any of the 4.
    -a-a The strong states used if the pattern is reinforced.

    Going up to 3 bit patterns is more of the same (add another bit,
    doubling the number of states). Seemingly something goes nasty when
    getting to 4 bit patterns though (and can't fit both weak and strong
    states for longer patterns, so the 4b patterns effectively only exist
    as weak states which partly overlap with the weak states for the 3-bit
    patterns).

    But, yeah, not going to type out state tables for these ones.


    Not proven, but I suspect that an arbitrary 5 bit pattern within a 6
    bit state might be impossible. Although there would be sufficient
    state-space for the looping 5-bit patterns, there may not be
    sufficient state-space to distinguish whether to move from a
    mismatched 4-bit pattern to a 3 or 5 bit pattern. Whereas, at least
    with 4-bit, any mismatch of the 4-bit pattern can always decay to a 3-
    bit pattern, etc. One needs to be able to express decay both to
    shorter patterns and to longer patterns, and I suspect at this point,
    the pattern breaks down (but can't easily confirm; it is either this
    or the pattern extends indefinitely, I don't know...).


    Could almost have this sort of thing as a "brain teaser" puzzle or
    something...

    Then again, maybe other people would not find any particular
    difficulty in these sorts of tasks.


    Terje




    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Mon Feb 23 17:17:59 2026
    From Newsgroup: comp.arch

    On 2/18/26 3:45 PM, BGB wrote:
    On 2/16/2026 3:14 PM, Paul Clayton wrote:
    On 11/5/25 2:00 AM, BGB wrote:
    On 11/4/2025 3:44 PM, Terje Mathisen wrote:
    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    [snip]
    Branch prediction is fun.

    When I looked around online before, a lot of stuff about
    branch prediction was talking about fairly large and
    convoluted schemes for the branch predictors.

    You might be interested in looking at the 6th Championship
    Branch Prediction (2025): https://ieeetcca.org/2025/02/18/6th-
    championship-branch-prediction-cbp2025/


    Quick look, didn't see much information about who entered or won...

    This presentation (PDF) presents the winners: https://ericrotenberg.wordpress.ncsu.edu/files/2025/06/CBP2025-Closing-Remarks.pdf

    1st Place: Toru Koizumi et al.'s "RUNLTS: Register-value-aware
    predictor Utilizing Nested Large TableS"
    PDF of paper: https://ericrotenberg.wordpress.ncsu.edu/files/2025/06/cbp2025-final44-Koizumi.pdf


    2nd Place: Andr|- Seznec's "TAGE-SC for CBP2025"
    PDF of paper: https://ericrotenberg.wordpress.ncsu.edu/files/2025/06/cbp2025-final37-Seznec.pdf

    3rd Place: Yang Man, el. al's "LVCP: A Load-Value Correlated
    Predictor for TAGE-SC-L"
    PDF of paper: https://ericrotenberg.wordpress.ncsu.edu/files/2025/06/cbp2025-final15-Man.pdf

    The program lists the competitors and provides links to papers,
    presentations, video, and code: https://ericrotenberg.wordpress.ncsu.edu/cbp2025-workshop-program/

    I probably should have provided the link to the program to save
    any readers time and effort. (I had not remembered how many link
    traversals were required and just picked a url from my Firefox
    history.)

    (One nice aspect of this workshop/competition is that one can
    reproduce the results and try one's own designs. As long as one
    can trust that the traces are representative of the workloads
    one cares about, this seems significant.)

    TAgged GEometric length predictors (TAGE) seem to the current
    "hotness" for branch predictors. These record very long global
    histories and fold them into shorter indexes with the number of
    history bits used varying for different tables.

    (Because the correlation is less strong, 3-bit counters are
    generally used as well as a useful bit.)


    When I messed with it, increasing the strength of the saturating
    counters was not effective.

    But, increasing the ability of them to predict more complex
    patterns did help.

    Branch prediction is heuristic and the tradeoffs change based on
    the workload and the storage (and latency) budget.

    Saturating counters record localized bias not a pattern, so
    fuzzy behavior might be more accurately predicted on average
    while (as you noted) repeated patterns are less likely to be
    predicted accurately.

    Larger counters can even out some irregularity in the bias but
    increase training time. (With tags, a new entry can be
    initialized to a decent state, which can reduce training time.)

    For per-address prediction, there are ways of reducing the
    overhead for tagging each entry. If a branch target buffer (BTB)
    is used, one is already storing a target address (or at least an
    offset/cache index) so adding some tag bits is not quite as
    expensive and allows associativity; placing a per-address
    prediction in a BTB was a common design. Per-address predictions
    can also be associated with the Icache, in which case they can
    reuse the Icache tagging. Tagging reduces aliases, but obviously
    bits spent on tags cannot be spent on prediction entries
    themselves (fewer entries means evictions that reduce training
    time).

    Aliasing is a major issue even with per-address prediction; with
    global branch history aliasing is even more likely. Several
    techniques have been proposed (besides tagging) to reduce
    aliasing such as multiple diversely indexed predictors with a
    majority vote. (The never manufactured Alpha 21464 design used
    three differently indexed tables to produce a majority vote and
    a fourth table to choose whether to use the majority vote or the
    prediction form a specific table. See "Design Tradeoffs for the
    Alpha EV8 Conditional Branch Predictor" for details.)

    (Mitch Alsup would, of course, mention his agree predictor which
    uses a second source such as a per-address prediction or a
    static prediction to exploit that predictions are likely to
    agree with their local prediction and two branches that use the
    same index could have different direction biases. In addition,
    if the global predictor is used as a complement to a per-address
    predictor, different branches with the same global index may be
    likely to disagree with a per-address predictor.)

    Capacity issues can be so significant that for some workloads
    single bit predictors outperform two-bit counters. Predicting
    more static (per-address) branches somewhat accurately can be
    more important than predicting fewer branches more accurately.

    Counters also have the minor advantage of providing a confidence
    estimate that can be used for things like prediction selection,
    checkpoint selection (doing a renaming snapshot to speed
    misprediction recovery), dynamic predication, prefetch
    throttling, or execution throttling (to save power/energy).

    But, then always at the end of it using 2-bit saturating
    counters:
    -a-a weakly taken, weakly not-taken, strongly taken, strongly
    not taken.

    But, in my fiddling, there was seemingly a simple but
    moderately effective strategy:
    -a-a Keep a local history of taken/not-taken;
    -a-a XOR this with the low-order-bits of PC for the table index;
    -a-a Use a 5/6-bit finite-state-machine or similar.
    -a-a-a-a Can model repeating patterns up to ~ 4 bits.

    Indexing a predictor by _local_ (i.e., per instruction address)
    history adds a level of indirection; once one has the branch
    (fetch) address one needs to index the local history and then
    use that to index the predictor.
    [snip]

    OK, seems I wrote it wrong:
    I was using a global branch history.

    The terminology can be tricky. The worst is probably the
    difference between bimode and bimodal; for former is a similar
    to agree prediction in attempting to separate bias, the latter
    is a name for per-address prediction (two modes, taken or not
    taken).

    But, either way, the history is XOR'ed with the relevant bits
    from PC to generate the index.

    This is called a gshare predictor as distinguished from a
    gselect predictor which indexes with just the global history.
    (There is also a distinction of path history rCo address bits
    rather than taken/not-taken rCo and global branch history.) As you
    probably discovered in your experimentation, hashing in the
    address bits provides significantly better prediction.

    I do not know if 5/6-bit state machines have been academically
    examined for predictor entries. I suspect the extra storage is a
    significant discouragement given one often wants to cover more
    different correlations and branches.


    If the 5/6-bit FSM can fit more patterns than 3x 2-bit
    saturating counters, it can be a win.

    I suspect it very much depends on whether bias or pattern is
    dominant. This would depend on the workload (Doom?) and the
    table size (and history length). I do not know that anyone in
    academia has explored this, so I think you should be proud of
    your discovery even if it has limited application.

    A larger table (longer history) can mean longer training, but
    such also discovers more patterns and longer patterns (e.g.,
    predicting a fixed loop count). However, correlation strength
    tends to decrease with increasing distance (having multiple
    history lengths and hashings helps to find the right history).

    As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.

    When the pattern is exactly repeated this is great, but if the
    correlation with global history is fuzzy (but biased) a counter
    might be better.

    I get the impression that branch prediction is complicated
    enough that even experts only have a gist of what is actually
    happening, i.e., there is a lot of craft and experimentation and
    less logical derivation (I sense).

    With the PC XOR Hist, lookup, there were still quite a few
    patterns, and not just contiguous 0s or 1s that the saturating
    counter would predict, nor just all 0s, all 1s, or ...101010...
    that the 3-bit FSM could deal with.

    But, a bigger history could mean less patterns and more
    contiguous bits in the state.

    TAGE has the advantage that the tags reduce branch aliases and
    the variable history length (with history folding/compression)
    allows using less storage (when a prediction only benefits from
    a shorter history) and reduces training time.

    In my case, was using 6-bit lookup mostly to fit into LUT6 based
    LUTRAM.

    Going bigger than 6 bits here is a pain point for FPGAs, more so
    as BRAM's don't support narrow lookups, so the next size up
    would likely be 2048x but then inefficiently using the BRAM
    (there isn't likely a particularly good way to make use of 512x
    18-bits).

    Presumably selecting specific bits of the 18 (shifting/alignment
    network) is expensive in an FPGA? With more bits per index, it
    might also make sense to use associativity with partial tags. It
    might be practical with two-way associativity to use different
    numbers of tag bits (with different address hashings) in
    different ways and different state sizes.

    [snip]
    Local history patterns may also be less common that statistical
    correlation after one has extracted branches predicted well by
    global history. (For small-bodied loops, a moderately long
    global history provides substantial local history.)

    It seems what I wrote originally was inaccurate, I don't store a
    history per-target, merely it was recent taken/not-taken branches.

    Well, I mixed up way and set in one of my comp.arch postings,
    and this was after (if I recall correctly) having read Computer
    Architecture: A Quantitative Approach as well as more than a
    couple of papers and web articles and comp.arch postings.

    But, I no longer remember what I was thinking at the time, or
    why I had written local history rather than global history
    (unless I meant "local" in terms of recency or something, I
    don't know).

    "Local" is not really a good term for per-address because of the
    potential confusion with temporal locality and the generality (a
    local branch prediction is not about predicting branches with
    nearby addresses rCo I am skeptical that there is much "spatial
    locality" for branch direction).

    (Terminology in computer architecture is not very well
    organized, consistent, or insightful. I feel "Instruction
    Pointer" is a better term than "Program Counter" and
    "Translation Cache" is a better term than "Translation Lookaside
    Buffer"; the former is still strong because of x86, but the
    latter had Itanium as a "backer"ry|. In terms of historical
    precedence, "elbow cache" should be preferred for a cache that
    uses different indexing methods to find an alternative index for
    an evicted block rather than "cuckoo cache" which seems more
    evocative and was used for hash tables. Then there is "block"
    (IBM) versus "line" (Intel) and whether a "sector" is larger or
    smaller than a line/block and what pipeline stage is "issue" and
    which "dispatch".)

    [snip]

    A lot of this seems a lot more complex though than what would be
    all that practical on a Spartan or Artix class FPGA.

    Modern high performance branch predictors are huge! The branch
    prediction section for AMD's Zen 4 is larger than the
    instruction cache and is close to the size of (possibly larger
    than) the instruction cache and decode sections: https://substackcdn.com/image/fetch/$s_!Ak84!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d037035-aad5-4ff5-931b-c51f0409e9f3_787x378.jpeg
    (from the Chips & Cheese article "Zen 5rCOs 2-Ahead Branch
    Predictor Unit: How a 30 Year Old Idea Allows for New Tricks" https://chipsandcheese.com/p/zen-5s-2-ahead-branch-predictor-unit-how-30-year-old-idea-allows-for-new-tricks
    )

    I was mostly using 5/6 bit state machines as they gave better
    results than 2-bit saturating counters, and fit nicely within
    the constraints of a "history XOR PC" lookup pattern.

    I think it is very neat that you were experimenting with such.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Feb 25 17:40:11 2026
    From Newsgroup: comp.arch

    On 2/23/2026 4:17 PM, Paul Clayton wrote:
    On 2/18/26 3:45 PM, BGB wrote:

    snip, not feeling up to responding to everything at the moment.

    So, for now, reevaluating things more narrowly.


    I do not know if 5/6-bit state machines have been academically
    examined for predictor entries. I suspect the extra storage is a
    significant discouragement given one often wants to cover more
    different correlations and branches.


    If the 5/6-bit FSM can fit more patterns than 3x 2-bit saturating
    counters, it can be a win.

    I suspect it very much depends on whether bias or pattern is
    dominant. This would depend on the workload (Doom?) and the
    table size (and history length). I do not know that anyone in
    academia has explored this, so I think you should be proud of
    your discovery even if it has limited application.

    A larger table (longer history) can mean longer training, but
    such also discovers more patterns and longer patterns (e.g.,
    predicting a fixed loop count). However, correlation strength
    tends to decrease with increasing distance (having multiple
    history lengths and hashings helps to find the right history).


    Yeah.

    I just decided to go off and to some slightly more extensive testing.


    As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.

    When the pattern is exactly repeated this is great, but if the
    correlation with global history is fuzzy (but biased) a counter
    might be better.

    I get the impression that branch prediction is complicated
    enough that even experts only have a gist of what is actually
    happening, i.e., there is a lot of craft and experimentation and
    less logical derivation (I sense).


    I decided to try comparing a few options by sampling branch data and
    then testing against it offline.

    Though, the current empirically-sampled data is limited in scope
    (currently covering running the Doom game loop, and with enough
    sample-points to cover roughly 2 seconds of running time).

    Even for a 50 MHz CPU, sampling more than a few seconds worth of
    branches would eat a significant chunk of RAM and make for a fairly
    large file.


    Recording the startup process, launching Doom, and running part of the
    game loop, while preferable, will take up too much space (if one is
    recording both the PC address and branch direction).

    In some other past contexts, it looked like the 5/6 bit FSM was the
    clear winner. But, with Doom samples and offline testing, not so clear.


    Here, context is the low-order bits of PC XORed with the recent branch history.


    Miss rate from the collected data (with a 6-bit context):
    1-bit, last bit : ~ 19.14%
    2-bit, sat counter : ~ 14.87% (2nd place)
    3-bit, sat counter : ~ 18.24%
    5/6-bit, GA evolved : ~ 14.25% (winner)

    3/4-bit, alternating (hand): ~ 35.95% (bad)
    4/5-bit, 2/3 bit pat (hand): ~ 38.75% (bad)
    5/6 (hand) : Untested (too much effort)
    (Will leave out the hand-written FSMs as they seem to suck).
    ...

    If context is reduced to 5 bits:
    1-bit, last bit : ~ 30.55%
    2-bit, sat counter : ~ 17.37% (2nd place)
    3-bit, sat counter : ~ 21.69%
    5/6-bit FSM : ~ 15.86% (winner)

    If context is reduced to 4 bits:
    1-bit, last bit : ~ 30.55%
    2-bit, sat counter : ~ 22.15% (2nd place)
    3-bit, sat counter : ~ 29.04%
    5/6-bit FSM : ~ 19.25% (winner)


    If context is increased to 8 bits:
    1-bit, last bit : ~ 13.07%
    2-bit, sat counter : ~ 12.32% (winner)
    3-bit, sat counter : ~ 13.96%
    5/6-bit FSM : ~ 12.79% (2nd place)

    If context is increased to 12 bits:
    1-bit, last bit : ~ 9.46% (winner)
    2-bit, sat counter : ~ 9.85% (2nd place)
    3-bit, sat counter : ~ 10.38%
    5/6-bit FSM : ~ 10.70%


    So, it seems increasing the context size causes the FSM to lose its
    advantage (mostly as more of the context seems to be dominated by
    single-bit patterns; and accuracy based on how quickly it can adapt).

    the 3-bit saturating counter here is nearly always worse than the 2-bit saturating counter.


    If optimizing for "most effectiveness per bit of storage":
    8-bit context + 2-bit sat counter would win;
    If optimizing for FPGA resource cost: 5 and 6 bit contexts would win.



    Did go any modify the GA table evolver to also use the sampled data as reference for the initial evolver step (along with the original purely synthetic patterns). This appears to have maybe helped.

    Slightly faster adaptation and noise tolerance; but margins seem small
    (can't rule out the possibility of the difference being due to
    RNG/"butterfly effect" or similar, rather than due to training on "real
    world" data vs purely synthetic patterns).


    With 6-bit context, sampled data vs pure synthetic:
    5/6-bit, GA evolved (real) : ~ 14.25% (winner)
    5/6-bit, GA evolved (synth): ~ 15.33%

    Where, the inclusion of real sampled data in the training process does
    appear to improve accuracy over a version trained purely on synthetic patterns.



    The using a genetic algorithm to generate the tables seems to vastly outperform my ability to naively hand-fill the tables.

    I suspect this is because the GA is able to generate tables which adapt
    to changing patterns more quickly.


    The hand-filled tables tend to need a to get back to "hub states" which
    then migrate to the target state for the pattern. The GA-generated
    tables seem able to side-step the need for hub-states and adapt to the
    change in pattern within around a few bits (if incrementing the
    bit-pattern, is seems to have a 2-bit lead-in).


    But, the need for hub states does put the hand-filled patterns at a disadvantage relative to the saturating counters, which as noted, does
    not apply to whatever the genetic-algorithm is doing.

    So, it seems like the hand-written tables are at a serious disadvantage
    here.


    I did verify that it was still able to adapt to arbitrary repeating 3
    and 4 bit patterns (and accurately express these patterns).

    with the main variability being that the GA-evolved patterns seem to
    adapt more quickly and are also better able to tolerate "noise".


    But, alas, this doesn't exactly mean that a 5/6 bit FSM is a clear
    winner over a 2-bit saturating counter in the general case.

    Comparing against branch values from initial boot process (6b ctx):
    1-bit, last bit : ~ 11.39%
    2-bit, sat counter : ~ 6.48%
    3-bit, sat counter : ~ 33.86%
    5/6-bit, GA evolved (real) : ~ 4.90%
    5/6-bit, GA evolved (synth): ~ 5.02%


    Granted, this was the context that originally led me to choose the 5/6
    bit FSM as the clear winner over the 2-bit saturating counter.


    And, for reasons, I mostly ended up not using hand-written FSMs in any
    of these cases as, I don't understand whatever black arts the GA is
    using here, nor can I replicate the same level of performance (also,
    even if I have since figured out how to do so, writing out such an FSM
    by hand is still a PITA).

    So, alas, the use of a genetic algorithm seems superior at this task...


    In this case, a C version of the 5/6 bit FSM looking like:
    static byte fsmtab_5b[64]={
    0x32, 0x20, 0x3E, 0x3B, 0x18, 0x02, 0x31, 0x13,
    0x0C, 0x10, 0x06, 0x16, 0x1E, 0x24, 0x06, 0x12,
    0x0F, 0x0B, 0x01, 0x2B, 0x34, 0x19, 0x1D, 0x2C,
    0x33, 0x37, 0x26, 0x1B, 0x0D, 0x11, 0x08, 0x03,
    0x0D, 0x39, 0x3D, 0x2A, 0x01, 0x13, 0x26, 0x37,
    0x34, 0x36, 0x3F, 0x2D, 0x15, 0x21, 0x0E, 0x2B,
    0x29, 0x2E, 0x38, 0x04, 0x0C, 0x25, 0x00, 0x17,
    0x30, 0x2F, 0x09, 0x03, 0x14, 0x05, 0x3D, 0x23,
    };

    With the input/output data bit in the LSB (other 5 bits being state).


    ...


    --- Synchronet 3.21b-Linux NewsLink 1.2