• A useless machine

    From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Feb 14 09:57:13 2026
    From Newsgroup: comp.arch

    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture. Graphics cards have sped this
    up enormously.

    (A reminder, if any is needed: From a starting number n, the
    transformation, recursively applied,


    f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    All the work is done on general-purpose hardware, which has many
    capabilities that are not needed, and consume area and power that special-purpose hardware would not need. Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    What could specialized hardware look like?

    The central part could be a unit which is handed a block of numbers
    to check, for example a range of 2^32 numbers. The current records
    for number of iterations and maximum number reached would also be
    stored in that unit.

    In that unit, the calculaiton would be performed by a combination
    of adder and shifter. The input would always be odd (so the last
    bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1).

    The adder would compute (3*n+1)/2, and would be specialzed 128-bit
    adder. Why specialized? The logic is far less complex than a
    general-purpose adder. For example, a two-bit adder has the formula

    y1 = (!a1&a0) | (!c_in&a1&!a0);
    y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0);

    and its propagate and generate bits are

    p = (a1);
    g = (a1&a0);

    The maximum number of trailing zeros after the (3*n+1)/2 step is
    three and can be determined cheaply from bits <3:1> of the input,
    so the input to the shifter (two levels only) can be provided
    in parallel. The machine would be designed that adding and
    shifting can be done in a single cycle.

    The number of iterations would be tracked; if larger than a prevous
    record, the calculation would terminate with a corresponding signal
    to the outside world.

    Also, each cycle, the result from the previous round would be
    compared against the input value. If it is lower, the next number
    would be chosen. Finally, the number reached would be checked against
    the previous, pre-set maximum, with would also signal to the outside
    world.

    Which numbers to chose? It is well known that only certain
    numbers with remainders modulo 2^n are interesting. The number
    can be seen on OEIS A076227. For example, modulo 2^10, only 64
    numbers are "interesting", or 6.25% (but only eight bits would
    be needed to store them because the last two are always one).
    Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the
    amount of numbers to be checked by half. One would use a small ROM,
    I presume. The number of bits would be a tradeoff between area and
    amount of compuation. My current guess would be that 2^10 could
    be a sweet spot. One third of the numbers would further be
    excluded by checking for divisiblity by three.

    The problem is embarassingly parallel. One could chose a size so
    each parcel is calculated in a few seconds or even minutes. to avoid
    having to avoid a large amount of communication with the outside.

    Because all units would be working all the time, and if one has
    a smaller unit one could simply build more of them, I suspect that
    power would be the main limiting factor. An adder design that
    might not be very fast, but uses lower power, low voltage and low
    frequency would be beneficial.

    Comments?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Feb 14 18:48:39 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture. Graphics cards have sped this
    up enormously.

    (A reminder, if any is needed: From a starting number n, the
    transformation, recursively applied,


    f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    While n/2 remains even, n/2 is guaranteed to converge to 1.
    So, only the odd cases are interesting.

    3*n+1 is always even (since n was odd, and odd*odd is odd and
    odd+1 is even). So, a double iteration is (3*n+1)/2.

    All the work is done on general-purpose hardware, which has many
    capabilities that are not needed, and consume area and power that special-purpose hardware would not need. Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    Induction would say it has been proven by then.

    What could specialized hardware look like?

    The central part could be a unit which is handed a block of numbers
    to check, for example a range of 2^32 numbers. The current records
    for number of iterations and maximum number reached would also be
    stored in that unit.

    In that unit, the calculaiton would be performed by a combination
    of adder and shifter. The input would always be odd (so the last
    bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1).

    The adder would compute (3*n+1)/2, and would be specialzed 128-bit
    adder. Why specialized? The logic is far less complex than a general-purpose adder. For example, a two-bit adder has the formula

    y1 = (!a1&a0) | (!c_in&a1&!a0);
    y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0);

    and its propagate and generate bits are

    p = (a1);
    g = (a1&a0);

    The maximum number of trailing zeros after the (3*n+1)/2 step is
    three and can be determined cheaply from bits <3:1> of the input,
    so the input to the shifter (two levels only) can be provided
    in parallel. The machine would be designed that adding and
    shifting can be done in a single cycle.

    One can also prove that certain kinds of carry chains would never
    be a speed path in this calculation and guarantee that both shifter
    and adder are 1 cycle.

    The number of iterations would be tracked; if larger than a prevous
    record, the calculation would terminate with a corresponding signal
    to the outside world.

    Also, each cycle, the result from the previous round would be
    compared against the input value. If it is lower, the next number
    would be chosen. Finally, the number reached would be checked against
    the previous, pre-set maximum, with would also signal to the outside
    world.

    Which numbers to chose? It is well known that only certain
    numbers with remainders modulo 2^n are interesting. The number
    can be seen on OEIS A076227. For example, modulo 2^10, only 64
    numbers are "interesting", or 6.25% (but only eight bits would
    be needed to store them because the last two are always one).
    Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the
    amount of numbers to be checked by half. One would use a small ROM,
    I presume. The number of bits would be a tradeoff between area and
    amount of compuation. My current guess would be that 2^10 could
    be a sweet spot. One third of the numbers would further be
    excluded by checking for divisiblity by three.

    The problem is embarassingly parallel. One could chose a size so
    each parcel is calculated in a few seconds or even minutes. to avoid
    having to avoid a large amount of communication with the outside.

    Because all units would be working all the time, and if one has
    a smaller unit one could simply build more of them, I suspect that
    power would be the main limiting factor. An adder design that
    might not be very fast, but uses lower power, low voltage and low
    frequency would be beneficial.

    Comments?

    Interesting problem,
    waste of power.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sat Feb 14 20:18:26 2026
    From Newsgroup: comp.arch

    On 14/02/2026 19:48, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture. Graphics cards have sped this
    up enormously.

    (A reminder, if any is needed: From a starting number n, the
    transformation, recursively applied,


    f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    While n/2 remains even, n/2 is guaranteed to converge to 1.
    So, only the odd cases are interesting.

    3*n+1 is always even (since n was odd, and odd*odd is odd and
    odd+1 is even). So, a double iteration is (3*n+1)/2.

    Yes, that is often a micro-optimisation used.


    All the work is done on general-purpose hardware, which has many
    capabilities that are not needed, and consume area and power that
    special-purpose hardware would not need. Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    Induction would say it has been proven by then.

    Would you like to re-think that? "Induction" says nothing of the sort.

    The conjecture has been proven true for all numbers up to 2 ^ 71
    (assuming that is the current figure) - it has most definitely /not/
    been proven to be true for all numbers. There are other things proven
    about it, such as the asymptotic rarity of numbers that are exceptions
    to the pattern, but no one has any notion of how to prove it in general.

    Being true up to 2 ^ 71 makes it reasonable to believe it is always
    true, but that is very different from being /proven/. There are plenty
    of mathematical conjectures that are true up to very large numbers, but
    fail later on.


    Comments?

    Interesting problem,
    waste of power.

    It is a mathematically very interesting problem. Whether there is
    anything to be gained by continuing the search for exceptions by brute
    force, is a different matter - but /if/ it is worth doing, then doing so
    more efficiently is a good idea.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Feb 14 21:22:15 2026
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 14/02/2026 19:48, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture. Graphics cards have sped this
    up enormously.

    (A reminder, if any is needed: From a starting number n, the
    transformation, recursively applied,


    f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    While n/2 remains even, n/2 is guaranteed to converge to 1.
    So, only the odd cases are interesting.

    3*n+1 is always even (since n was odd, and odd*odd is odd and
    odd+1 is even). So, a double iteration is (3*n+1)/2.

    Yes, that is often a micro-optimisation used.


    All the work is done on general-purpose hardware, which has many
    capabilities that are not needed, and consume area and power that
    special-purpose hardware would not need. Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    Induction would say it has been proven by then.

    Would you like to re-think that? "Induction" says nothing of the sort.

    The conjecture has been proven true for all numbers up to 2 ^ 71
    (assuming that is the current figure) - it has most definitely /not/
    been proven to be true for all numbers. There are other things proven
    about it, such as the asymptotic rarity of numbers that are exceptions
    to the pattern, but no one has any notion of how to prove it in general.

    By proving up to 2^71, you simultaneously prove that it cannot be proven
    by computer-like arithmetic processes. What if the number was 2^71,000,000
    you still have not proven it? Thereby, it is not provable by simple computer arithmetic processes.

    Ergo, you need a different means.

    Being true up to 2 ^ 71 makes it reasonable to believe it is always
    true, but that is very different from being /proven/. There are plenty
    of mathematical conjectures that are true up to very large numbers, but
    fail later on.

    You need cannot even /prove/ it using BigNum computer arithmetic in finite time.


    Comments?

    Interesting problem,
    waste of power.

    It is a mathematically very interesting problem. Whether there is
    anything to be gained by continuing the search for exceptions by brute force, is a different matter - but /if/ it is worth doing, then doing so more efficiently is a good idea.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Feb 14 13:30:36 2026
    From Newsgroup: comp.arch

    On 2/14/2026 1:22 PM, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 14/02/2026 19:48, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture. Graphics cards have sped this
    up enormously.

    (A reminder, if any is needed: From a starting number n, the
    transformation, recursively applied,


    f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    While n/2 remains even, n/2 is guaranteed to converge to 1.
    So, only the odd cases are interesting.

    3*n+1 is always even (since n was odd, and odd*odd is odd and
    odd+1 is even). So, a double iteration is (3*n+1)/2.

    Yes, that is often a micro-optimisation used.


    All the work is done on general-purpose hardware, which has many
    capabilities that are not needed, and consume area and power that
    special-purpose hardware would not need. Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    Induction would say it has been proven by then.

    Would you like to re-think that? "Induction" says nothing of the sort.

    The conjecture has been proven true for all numbers up to 2 ^ 71
    (assuming that is the current figure) - it has most definitely /not/
    been proven to be true for all numbers. There are other things proven
    about it, such as the asymptotic rarity of numbers that are exceptions
    to the pattern, but no one has any notion of how to prove it in general.

    By proving up to 2^71, you simultaneously prove that it cannot be proven
    by computer-like arithmetic processes. What if the number was 2^71,000,000 you still have not proven it? Thereby, it is not provable by simple computer arithmetic processes.

    While it is certainly true that you can't prove it, no matter how large
    a number you get up to, you may be able to disprove it, which would be important in and of itself.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Sat Feb 14 21:19:53 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:

    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture. Graphics cards have sped this
    up enormously.

    (A reminder, if any is needed: From a starting number n, the
    transformation, recursively applied,


    f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    All the work is done on general-purpose hardware, which has many
    capabilities that are not needed, and consume area and power that special-purpose hardware would not need. Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    What could specialized hardware look like?

    The central part could be a unit which is handed a block of numbers
    to check, for example a range of 2^32 numbers. The current records
    for number of iterations and maximum number reached would also be
    stored in that unit.

    In that unit, the calculaiton would be performed by a combination
    of adder and shifter. The input would always be odd (so the last
    bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1).

    The adder would compute (3*n+1)/2, and would be specialzed 128-bit
    adder. Why specialized? The logic is far less complex than a general-purpose adder. For example, a two-bit adder has the formula

    y1 = (!a1&a0) | (!c_in&a1&!a0);
    y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0);

    and its propagate and generate bits are

    p = (a1);
    g = (a1&a0);

    The maximum number of trailing zeros after the (3*n+1)/2 step is
    three and can be determined cheaply from bits <3:1> of the input,
    so the input to the shifter (two levels only) can be provided
    in parallel. The machine would be designed that adding and
    shifting can be done in a single cycle.

    The number of iterations would be tracked; if larger than a prevous
    record, the calculation would terminate with a corresponding signal
    to the outside world.

    Also, each cycle, the result from the previous round would be
    compared against the input value. If it is lower, the next number
    would be chosen. Finally, the number reached would be checked against
    the previous, pre-set maximum, with would also signal to the outside
    world.

    Which numbers to chose? It is well known that only certain
    numbers with remainders modulo 2^n are interesting. The number
    can be seen on OEIS A076227. For example, modulo 2^10, only 64
    numbers are "interesting", or 6.25% (but only eight bits would
    be needed to store them because the last two are always one).
    Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the
    amount of numbers to be checked by half. One would use a small ROM,
    I presume. The number of bits would be a tradeoff between area and
    amount of compuation. My current guess would be that 2^10 could
    be a sweet spot. One third of the numbers would further be
    excluded by checking for divisiblity by three.

    The problem is embarassingly parallel. One could chose a size so
    each parcel is calculated in a few seconds or even minutes. to avoid
    having to avoid a large amount of communication with the outside.

    Because all units would be working all the time, and if one has
    a smaller unit one could simply build more of them, I suspect that
    power would be the main limiting factor. An adder design that
    might not be very fast, but uses lower power, low voltage and low
    frequency would be beneficial.

    Comments?

    Amusing. I predict that ultimately it will be a waste of
    effort because the Collatz conjecture is true even though
    currently unproven. Also it seems likely that no mathematical
    insights will result from any sort of brute force approach to
    disprove it, even if the "brute force" scheme is selective.

    If anyone is interested in looking into questions regarding the
    Collatz conjecture, I suggest considering a generalization

    f[k](n) = n/2 if n is even, 3n+k if n is odd, for an fixed odd k

    Empirically you should quickly find

    There is always a cycle starting at k. (Duh!)

    There is only one cycle if k is a power of 3. (This result
    follows from the Collatz conjecture.)

    If k is not a power of 3, there are always at least two cycles,
    and can be more.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Feb 15 03:58:28 2026
    From Newsgroup: comp.arch

    On 2026-02-14 4:57 a.m., Thomas Koenig wrote:
    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture. Graphics cards have sped this
    up enormously.

    (A reminder, if any is needed: From a starting number n, the
    transformation, recursively applied,


    f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    All the work is done on general-purpose hardware, which has many
    capabilities that are not needed, and consume area and power that special-purpose hardware would not need. Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    What could specialized hardware look like?

    The central part could be a unit which is handed a block of numbers
    to check, for example a range of 2^32 numbers. The current records
    for number of iterations and maximum number reached would also be
    stored in that unit.

    In that unit, the calculaiton would be performed by a combination
    of adder and shifter. The input would always be odd (so the last
    bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1).

    The adder would compute (3*n+1)/2, and would be specialzed 128-bit
    adder. Why specialized? The logic is far less complex than a general-purpose adder. For example, a two-bit adder has the formula

    y1 = (!a1&a0) | (!c_in&a1&!a0);
    y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0);

    and its propagate and generate bits are

    p = (a1);
    g = (a1&a0);

    The maximum number of trailing zeros after the (3*n+1)/2 step is
    three and can be determined cheaply from bits <3:1> of the input,
    so the input to the shifter (two levels only) can be provided
    in parallel. The machine would be designed that adding and
    shifting can be done in a single cycle.

    The number of iterations would be tracked; if larger than a prevous
    record, the calculation would terminate with a corresponding signal
    to the outside world.

    Also, each cycle, the result from the previous round would be
    compared against the input value. If it is lower, the next number
    would be chosen. Finally, the number reached would be checked against
    the previous, pre-set maximum, with would also signal to the outside
    world.

    Which numbers to chose? It is well known that only certain
    numbers with remainders modulo 2^n are interesting. The number
    can be seen on OEIS A076227. For example, modulo 2^10, only 64
    numbers are "interesting", or 6.25% (but only eight bits would
    be needed to store them because the last two are always one).
    Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the
    amount of numbers to be checked by half. One would use a small ROM,
    I presume. The number of bits would be a tradeoff between area and
    amount of compuation. My current guess would be that 2^10 could
    be a sweet spot. One third of the numbers would further be
    excluded by checking for divisiblity by three.

    The problem is embarassingly parallel. One could chose a size so
    each parcel is calculated in a few seconds or even minutes. to avoid
    having to avoid a large amount of communication with the outside.

    Because all units would be working all the time, and if one has
    a smaller unit one could simply build more of them, I suspect that
    power would be the main limiting factor. An adder design that
    might not be very fast, but uses lower power, low voltage and low
    frequency would be beneficial.

    Comments?

    I wonder how difficult it would be to implement a parallel tester in an
    FPGA. It looks simple enough and there are hundreds of DSP blocks
    available to use. Could test in blocks of hundreds of numbers at a time. Running at 100 MHz * 200 tests at a time = how long to get to 2^71?

    I wonder if the calculation can be broken down further. Numbers are all
    just defined by rules. I wonder if some calculus may apply. Limit of
    function approaches one for a finite number of steps. There may be a
    whole range of algorithms, is there one for the number two? Then the
    number three?

    Occasionally I have made useless machinery. Great puzzle work. Made the
    game of life in hardware.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sun Feb 15 13:14:21 2026
    From Newsgroup: comp.arch

    On 15/02/2026 09:58, Robert Finch wrote:

    I wonder how difficult it would be to implement a parallel tester in an FPGA. It looks simple enough and there are hundreds of DSP blocks
    available to use. Could test in blocks of hundreds of numbers at a time. Running at 100 MHz * 200 tests at a time = how long to get to 2^71?


    One of the nice things about this is that you don't need DSP blocks - it
    all works well with very simple operations well-suited to an FPGA. Your calculation blocks would hold an n-bit number (wider than the number you
    are testing for, since rounds of the Collatz function can grow a lot)
    and a counter. Initialise the block to the number you want to test.
    For each round, check the LSB. If it is 0, right-shift your register.
    If it is even, replace "x" with "(x << 1) + x". That's just some simple
    logic - no DSP blocks involved. If your register or your counter
    overflow, you've found an "interesting" number and can pass that out to
    a PC to test it fully.

    There's lots of scope for both high-level work (the housekeeping and
    tracking of operations for each of your worker elements) and low-level
    work (getting the carry operations to work at high speed).

    I wonder if the calculation can be broken down further. Numbers are all
    just defined by rules. I wonder if some calculus may apply. Limit of function approaches one for a finite number of steps. There may be a
    whole range of algorithms, is there one for the number two? Then the
    number three?

    Occasionally I have made useless machinery. Great puzzle work. Made the
    game of life in hardware.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 15 15:35:26 2026
    From Newsgroup: comp.arch

    On Sun, 15 Feb 2026 03:58:28 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    On 2026-02-14 4:57 a.m., Thomas Koenig wrote:
    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture. Graphics cards have sped this
    up enormously.

    (A reminder, if any is needed: From a starting number n, the transformation, recursively applied,


    f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    All the work is done on general-purpose hardware, which has many capabilities that are not needed, and consume area and power that special-purpose hardware would not need. Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    What could specialized hardware look like?

    The central part could be a unit which is handed a block of numbers
    to check, for example a range of 2^32 numbers. The current records
    for number of iterations and maximum number reached would also be
    stored in that unit.

    In that unit, the calculaiton would be performed by a combination
    of adder and shifter. The input would always be odd (so the last
    bit could be dropped), and it would calculate (3*n+1)>>ctz(3*n+1).

    The adder would compute (3*n+1)/2, and would be specialzed 128-bit
    adder. Why specialized? The logic is far less complex than a general-purpose adder. For example, a two-bit adder has the formula

    y1 = (!a1&a0) | (!c_in&a1&!a0);
    y0 = (c_in&!a1&!a0) | (!c_in&a1&!a0) | (a1&a0);

    and its propagate and generate bits are

    p = (a1);
    g = (a1&a0);

    The maximum number of trailing zeros after the (3*n+1)/2 step is
    three and can be determined cheaply from bits <3:1> of the input,
    so the input to the shifter (two levels only) can be provided
    in parallel. The machine would be designed that adding and
    shifting can be done in a single cycle.

    The number of iterations would be tracked; if larger than a prevous
    record, the calculation would terminate with a corresponding signal
    to the outside world.

    Also, each cycle, the result from the previous round would be
    compared against the input value. If it is lower, the next number
    would be chosen. Finally, the number reached would be checked
    against the previous, pre-set maximum, with would also signal to
    the outside world.

    Which numbers to chose? It is well known that only certain
    numbers with remainders modulo 2^n are interesting. The number
    can be seen on OEIS A076227. For example, modulo 2^10, only 64
    numbers are "interesting", or 6.25% (but only eight bits would
    be needed to store them because the last two are always one).
    Modulo 2^16 would be 734 numbers, or 3.22%, so this would cut the
    amount of numbers to be checked by half. One would use a small ROM,
    I presume. The number of bits would be a tradeoff between area and
    amount of compuation. My current guess would be that 2^10 could
    be a sweet spot. One third of the numbers would further be
    excluded by checking for divisiblity by three.

    The problem is embarassingly parallel. One could chose a size so
    each parcel is calculated in a few seconds or even minutes. to avoid
    having to avoid a large amount of communication with the outside.

    Because all units would be working all the time, and if one has
    a smaller unit one could simply build more of them, I suspect that
    power would be the main limiting factor. An adder design that
    might not be very fast, but uses lower power, low voltage and low
    frequency would be beneficial.

    Comments?

    I wonder how difficult it would be to implement a parallel tester in
    an FPGA. It looks simple enough and there are hundreds of DSP blocks available to use. Could test in blocks of hundreds of numbers at a
    time. Running at 100 MHz * 200 tests at a time = how long to get to
    2^71?


    I am not totally sure what exactly you are asking, but if you are
    asking about how much time it takes at that rate to do 2**71 iterations
    then the answer is ~3743 years.
    But you probably already figured it out yourself.

    Of course, brute force demonstration of correction of conjecture for all numbers in range 1 to 2**71 will take more steps than 2**71. And it
    is unknown how much more.

    I wonder if the calculation can be broken down further. Numbers are
    all just defined by rules. I wonder if some calculus may apply. Limit
    of function approaches one for a finite number of steps. There may be
    a whole range of algorithms, is there one for the number two? Then
    the number three?

    Occasionally I have made useless machinery. Great puzzle work. Made
    the game of life in hardware.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 15 15:46:27 2026
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> schrieb:
    On 15/02/2026 09:58, Robert Finch wrote:

    I wonder how difficult it would be to implement a parallel tester in an
    FPGA. It looks simple enough and there are hundreds of DSP blocks
    available to use. Could test in blocks of hundreds of numbers at a time.
    Running at 100 MHz * 200 tests at a time = how long to get to 2^71?


    One of the nice things about this is that you don't need DSP blocks - it
    all works well with very simple operations well-suited to an FPGA. Your calculation blocks would hold an n-bit number (wider than the number you
    are testing for, since rounds of the Collatz function can grow a lot)
    and a counter. Initialise the block to the number you want to test.
    For each round, check the LSB. If it is 0, right-shift your register.
    If it is even, replace "x" with "(x << 1) + x". That's just some simple logic - no DSP blocks involved. If your register or your counter
    overflow, you've found an "interesting" number and can pass that out to
    a PC to test it fully.

    It is even simpler than that - using the optimization of always
    calculating (3*n+1)/2 is ideal for look-up tables. If you have
    six-bit LUTs, you divide your number into five-bit blocks and
    set up your LUTs for generating (3*n+1+carry_in)/2 - five LUTs
    for five result bits. You then also need two LUTs for generating
    the propagate and generate bits for the five-bit groups.

    You can then generate group generate and propagate bits for three
    five-bit groups with two LUTs (15 bits) and so on. Three levels
    of P and G would give you 135 bits, more than enough.

    The comparators could be more expensive; if this is done with LUTs,
    you could only compare three bits each. Maybe this is better
    with the DSP blocks, I don't know.

    There's lots of scope for both high-level work (the housekeeping and tracking of operations for each of your worker elements) and low-level
    work (getting the carry operations to work at high speed).

    Yep.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 15 19:19:16 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:

    "Unknown" is a correct theoretically answer, but I can provide a very educated guess that on average it would take ~500 steps and the worst
    case would not exceed 2000 steps.
    So, bad news are that overall it will take more than a million years on
    your machine. And good news are that it'll take less than 2 million
    years.

    Not as bad, very many simplifications are possible.

    For one, you can discount all even numbers as starting values.
    You can go further: If you look at residues modulo 2^n of your starting numbers, there are only A076227(n) "interesting" remaiders that
    you need to look at. If you take n=39, only 3611535862 cases
    need to be looked at, so only 3611535862/2^39 of all cases need to
    be looked at, a reduction of a factor around 150. You can also
    do some more reductions (only 2/3 of those numbers need checking
    if you can calculate the remainder by 3 fast). There are different
    tradeoffs for different n, of course (a factor of 16 for n=10,
    for example).

    The main key is parallelization.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 15 19:42:47 2026
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    Just thinking about a way to replace ~N iterations by a MUL, ADD and SHR

    Even with 128-bit numbers, I would only need two MULs, two ADD/ADC pairs
    and a SHRD/SHR, so maybe 8-10 cycles?

    https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
    an improved algorithm, which they are using for the supercomputer
    version. They are doing:

    n = n0
    repeat
    n = n + 1
    alpha = ctz(n)
    n = n * 3^alpha / 2^alpha
    n = n - 1
    beta = ctz(n)
    n = n / 2^beta
    until n < n_0

    They are also playing some strange tricks with parallel solving that
    are not described clearly, or I am not clever enough to understand
    what they are doing. I would have to look at their code, I guess.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Feb 15 22:40:22 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-15 7:14 a.m., David Brown wrote:
    On 15/02/2026 09:58, Robert Finch wrote:

    I wonder how difficult it would be to implement a parallel tester in
    an FPGA. It looks simple enough and there are hundreds of DSP blocks
    available to use. Could test in blocks of hundreds of numbers at a time. >>>> Running at 100 MHz * 200 tests at a time = how long to get to 2^71?


    One of the nice things about this is that you don't need DSP blocks - it
    Yeah, I figured that out after coding. 3n+1 = n+n+n+1

    3n+1 = (n<<1)+n+1

    one add with carry in and n gets wired in directly and also shifted
    up by 1 (no logic shifter).

    Even better:

    (3n+1)/2 = n + (n>>1) + 1

    I.e two iterations for the cost of one.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Feb 15 22:59:19 2026
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    Just thinking about a way to replace ~N iterations by a MUL, ADD and SHR

    Even with 128-bit numbers, I would only need two MULs, two ADD/ADC pairs
    and a SHRD/SHR, so maybe 8-10 cycles?

    https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
    an improved algorithm, which they are using for the supercomputer
    version. They are doing:

    n = n0
    repeat
    n = n + 1
    alpha = ctz(n)
    n = n * 3^alpha / 2^alpha

    Strange syntax...

    ctz(n) must count trailing zeroes, right?
    If so, then ctz(n+1) will return 1 or more for odd n, so that gives
    3n+3, which after the division by 2 returns a value which is one too high.

    n = n - 1

    but this is corrected here!

    The key must be that the multiplication by a power of three also works
    for n with multiple trailing 1 bits.

    The "x/2^alpha" part is of course just x>>alpha.

    For an even n they still do a multiplication by 1 and a SHR 0, but the
    code is branchless.

    beta = ctz(n)
    n = n / 2^beta
    until n < n_0

    I am guessing that the best way to do the n*3^alpha parts is a table of
    powers of 3, so

    let mut n = n0;
    loop {
    n += 1;
    let alpha = ctz(n);
    n = (n * power_of_three[alpha]) >> alpha;
    n -= 1;
    beta = ctz(n);
    n >>= beta;
    if n <= n_0 {break};
    }


    They are also playing some strange tricks with parallel solving that
    are not described clearly, or I am not clever enough to understand
    what they are doing. I would have to look at their code, I guess.

    :-)

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Feb 15 22:41:40 2026
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    I wonder how difficult it would be to implement a parallel tester in an >FPGA. It looks simple enough and there are hundreds of DSP blocks
    available to use. Could test in blocks of hundreds of numbers at a time. >Running at 100 MHz * 200 tests at a time = how long to get to 2^71?

    OTOH, with a PC with 8 cores running at 4GHZ, even with scalar
    operations you might outperform that. And with SIMD probably even
    more. The main problem is that support for 128-bit integers is not so
    great for scalar instructions and worse for SIMD instructions.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Feb 15 21:53:10 2026
    From Newsgroup: comp.arch

    On 2026-02-15 4:59 p.m., Terje Mathisen wrote:
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    Just thinking about a way to replace ~N iterations by a MUL, ADD and SHR >>>
    Even with 128-bit numbers, I would only need two MULs, two ADD/ADC pairs >>> and a SHRD/SHR, so maybe 8-10 cycles?

    https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
    an improved algorithm, which they are using for the supercomputer
    version.-a They are doing:

    n = n0
    repeat
    -a-a n = n + 1
    -a-a alpha = ctz(n)
    -a-a n = n * 3^alpha / 2^alpha

    Strange syntax...

    ctz(n) must count trailing zeroes, right?
    If so, then ctz(n+1) will return 1 or more for odd n, so that gives
    3n+3, which after the division by 2 returns a value which is one too high.

    -a-a n = n - 1

    but this is corrected here!

    The key must be that the multiplication by a power of three also works
    for n with multiple trailing 1 bits.

    The "x/2^alpha" part is of course just x>>alpha.

    For an even n they still do a multiplication by 1 and a SHR 0, but the
    code is branchless.

    -a-a beta = ctz(n)
    -a-a n = n / 2^beta
    until n < n_0

    I am guessing that the best way to do the n*3^alpha parts is a table of powers of 3, so

    let mut n = n0;
    loop {
    -a n += 1;
    -a let alpha = ctz(n);
    -a n = (n * power_of_three[alpha]) >> alpha;
    -a n -= 1;
    -a beta = ctz(n);
    -a n >>= beta;
    -a if n <= n_0 {break};
    }


    They are also playing some strange tricks with parallel solving that
    are not described clearly, or I am not clever enough to understand
    what they are doing. I would have to look at their code, I guess.

    :-)

    Terje

    How many iterations are done in one loop? I think it would need to be at
    least 10 on average to beat the simple approach. The simple approach
    only taking one clock per iteration. The 128-bit multiply and 128-bit
    count trailing zeros are slow in an FPGA.

    And, can the approach consume a more limited number of iterations? For
    example counting only the 6 LSB zeros?

    I found a mistake in my code handling groups. It can only do about a
    24x24=576 matrix of calculations in parallel @ 200MHz. So that's 1/4
    million years?




    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Feb 16 12:31:31 2026
    From Newsgroup: comp.arch

    On Sun, 15 Feb 2026 19:19:16 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:

    "Unknown" is a correct theoretically answer, but I can provide a
    very educated guess that on average it would take ~500 steps and
    the worst case would not exceed 2000 steps.
    So, bad news are that overall it will take more than a million
    years on your machine. And good news are that it'll take less than
    2 million years.

    Not as bad, very many simplifications are possible.

    For one, you can discount all even numbers as starting values.

    O.k.

    You can go further: If you look at residues modulo 2^n of your
    starting numbers, there are only A076227(n) "interesting" remaiders
    that you need to look at.

    Can not say that I understood.

    If you take n=39, only 3611535862 cases
    need to be looked at, so only 3611535862/2^39 of all cases need to
    be looked at, a reduction of a factor around 150. You can also
    do some more reductions (only 2/3 of those numbers need checking
    if you can calculate the remainder by 3 fast). There are different
    tradeoffs for different n, of course (a factor of 16 for n=10,
    for example).



    The estimate above was done for loop that finishes at 1.
    For brute-force disproof of Collatz conjecture it is unnecessary. We
    can stop at n < n0. I think, you mentioned it in your original post,
    but I didn't pay attention.
    Due to early exit mean number of iteration is cut from ~500 to 7 or at
    worst 8.

    The main key is parallelization.

    And here is a problem.
    If one want to finish in, say, 1 month, one would need ALOT of
    parallelization. Even after all improvements we are talking about
    several millions of testing nodes. At such scale I can think about two
    main problems:
    1. Dispatcher become a bottleneck
    2. Some node will fail. It has to be detected and their jobs
    re-allocated. Which further complicates a scheduler.

    It does not sound particularly hard if somebody pays for engineering,
    esp. if the whole work done in one dedicated data center.
    Otherwise it does not sound easy.

    The worst part is that that there is nothing special about 2**71.
    Or about 2**80 for that matter. Tim Reintch already said that better
    than I can. The machine is not merely useless, it is pointless.












    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Feb 16 12:40:54 2026
    From Newsgroup: comp.arch

    On Sun, 15 Feb 2026 21:53:10 -0500
    Robert Finch <robfi680@gmail.com> wrote:
    On 2026-02-15 4:59 p.m., Terje Mathisen wrote:
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    Just thinking about a way to replace ~N iterations by a MUL, ADD
    and SHR

    Even with 128-bit numbers, I would only need two MULs, two
    ADD/ADC pairs and a SHRD/SHR, so maybe 8-10 cycles?

    https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
    an improved algorithm, which they are using for the supercomputer
    version.a They are doing:

    n = n0
    repeat
    aa n = n + 1
    aa alpha = ctz(n)
    aa n = n * 3^alpha / 2^alpha

    Strange syntax...

    ctz(n) must count trailing zeroes, right?
    If so, then ctz(n+1) will return 1 or more for odd n, so that gives
    3n+3, which after the division by 2 returns a value which is one
    too high.
    aa n = n - 1

    but this is corrected here!

    The key must be that the multiplication by a power of three also
    works for n with multiple trailing 1 bits.

    The "x/2^alpha" part is of course just x>>alpha.

    For an even n they still do a multiplication by 1 and a SHR 0, but
    the code is branchless.

    aa beta = ctz(n)
    aa n = n / 2^beta
    until n < n_0

    I am guessing that the best way to do the n*3^alpha parts is a
    table of powers of 3, so

    let mut n = n0;
    loop {
    a n += 1;
    a let alpha = ctz(n);
    a n = (n * power_of_three[alpha]) >> alpha;
    a n -= 1;
    a beta = ctz(n);
    a n >>= beta;
    a if n <= n_0 {break};
    }


    They are also playing some strange tricks with parallel solving
    that are not described clearly, or I am not clever enough to
    understand what they are doing. I would have to look at their
    code, I guess.
    :-)

    Terje

    How many iterations are done in one loop?
    Worst case - few thousands, or less. I didn't yet encounter 1000.
    Mean - 6 or 7. May be, 8, but that's unlikely.
    I think it would need to be
    at least 10 on average to beat the simple approach. The simple
    approach only taking one clock per iteration. The 128-bit multiply
    and 128-bit count trailing zeros are slow in an FPGA.

    And, can the approach consume a more limited number of iterations?
    For example counting only the 6 LSB zeros?

    I found a mistake in my code handling groups. It can only do about a 24x24=576 matrix of calculations in parallel @ 200MHz. So that's 1/4
    million years?

    Much less.
    My 1st estimate was for more stupid loop that always counts down to 1.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Feb 16 16:08:35 2026
    From Newsgroup: comp.arch

    Robert Finch wrote:
    On 2026-02-15 4:59 p.m., Terje Mathisen wrote:
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    Just thinking about a way to replace ~N iterations by a MUL, ADD and
    SHR

    Even with 128-bit numbers, I would only need two MULs, two ADD/ADC
    pairs
    and a SHRD/SHR, so maybe 8-10 cycles?

    https://link.springer.com/article/10.1007/s11227-025-07337-0 gives
    an improved algorithm, which they are using for the supercomputer
    version.|e-a They are doing:

    n = n0
    repeat
    |e-a|e-a n = n + 1
    |e-a|e-a alpha = ctz(n)
    |e-a|e-a n = n * 3^alpha / 2^alpha

    Strange syntax...

    ctz(n) must count trailing zeroes, right?
    If so, then ctz(n+1) will return 1 or more for odd n, so that gives
    3n+3, which after the division by 2 returns a value which is one too
    high.

    |e-a|e-a n = n - 1

    but this is corrected here!

    The key must be that the multiplication by a power of three also works
    for n with multiple trailing 1 bits.

    The "x/2^alpha" part is of course just x>>alpha.

    For an even n they still do a multiplication by 1 and a SHR 0, but the
    code is branchless.

    |e-a|e-a beta = ctz(n)
    |e-a|e-a n = n / 2^beta
    until n < n_0

    I am guessing that the best way to do the n*3^alpha parts is a table
    of powers of 3, so

    let mut n = n0;
    loop {
    -a|e-a n += 1;
    -a|e-a let alpha = ctz(n);
    -a|e-a n = (n * power_of_three[alpha]) >> alpha;
    -a|e-a n -= 1;
    -a|e-a beta = ctz(n);
    -a|e-a n >>= beta;
    -a|e-a if n <= n_0 {break};
    }


    They are also playing some strange tricks with parallel solving that
    are not described clearly, or I am not clever enough to understand
    what they are doing. I would have to look at their code, I guess.

    :-)

    Terje

    How many iterations are done in one loop? I think it would need to be at least 10 on average to beat the simple approach. The simple approach
    only taking one clock per iteration. The 128-bit multiply and 128-bit
    count trailing zeros are slow in an FPGA.
    Right.
    The first part will handle trailing strings of 1's, while the second
    part does the same for all trailing zeroes after that initial mul_by_power_of_three.

    And, can the approach consume a more limited number of iterations? For > example counting only the 6 LSB zeros?
    It should work directly with a max(ctz,N) wrapper.

    I found a mistake in my code handling groups. It can only do about a 24x24=576 matrix of calculations in parallel @ 200MHz. So that's 1/4 > million years?
    FPGAs typically run at much, much lower frequencies than modern CPUs, so if the code above can average 3-5 iterations per loop, and do it in
    around 10 clock cycles, then the FPGA will be hard pressed to keep up?
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Feb 16 07:50:49 2026
    From Newsgroup: comp.arch

    On 2/14/2026 1:57 AM, Thomas Koenig wrote:
    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture. Graphics cards have sped this
    up enormously.

    big snip


    The number of iterations would be tracked; if larger than a prevous
    record, the calculation would terminate with a corresponding signal
    to the outside world.

    Also, each cycle, the result from the previous round would be
    compared against the input value. If it is lower, the next number
    would be chosen. Finally, the number reached would be checked against
    the previous, pre-set maximum, with would also signal to the outside
    world.

    Hold these thoughts

    snip

    The problem is embarassingly parallel.


    Sort of. You could do it that way, but you would be giving up some performance. Specifically, as you say, you want to check the number
    against the previous input value and the preset maximum. But in a
    parallel implementation, you want to check against the largest value
    checked *across all the parallel threads*. In order to do that, the
    threads need to communicate, thus violating the "embarrassingly
    parallel" rule. I don't know how much performance you would give up so
    it may not make much difference, nor do I know the cost of keeping some "global" largest value checked so far. There is a tradeoff here.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Feb 16 18:31:46 2026
    From Newsgroup: comp.arch

    On Mon, 16 Feb 2026 07:50:49 -0800
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

    On 2/14/2026 1:57 AM, Thomas Koenig wrote:
    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture. Graphics cards have sped this
    up enormously.

    big snip


    The number of iterations would be tracked; if larger than a prevous
    record, the calculation would terminate with a corresponding signal
    to the outside world.

    Also, each cycle, the result from the previous round would be
    compared against the input value. If it is lower, the next number
    would be chosen. Finally, the number reached would be checked
    against the previous, pre-set maximum, with would also signal to
    the outside world.

    Hold these thoughts

    snip

    The problem is embarassingly parallel.


    Sort of. You could do it that way, but you would be giving up some performance. Specifically, as you say, you want to check the number
    against the previous input value and the preset maximum. But in a
    parallel implementation, you want to check against the largest value
    checked *across all the parallel threads*. In order to do that, the
    threads need to communicate, thus violating the "embarrassingly
    parallel" rule. I don't know how much performance you would give up
    so it may not make much difference,

    If initial values N0 are above 2**64 and the number of tests NP that run
    in parallel is few millions or even few hundreds of millions then
    performance gain achieved by comparison vs N0+NP instead of comparison
    vs N0 is negligible.

    nor do I know the cost of keeping
    some "global" largest value checked so far. There is a tradeoff here.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Feb 16 20:34:44 2026
    From Newsgroup: comp.arch

    On Mon, 16 Feb 2026 12:40:54 +0200
    Michael S <already5chosen@yahoo.com> wrote:
    On Sun, 15 Feb 2026 21:53:10 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    On 2026-02-15 4:59 p.m., Terje Mathisen wrote:
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    Just thinking about a way to replace ~N iterations by a MUL, ADD
    and SHR

    Even with 128-bit numbers, I would only need two MULs, two
    ADD/ADC pairs and a SHRD/SHR, so maybe 8-10 cycles?

    https://link.springer.com/article/10.1007/s11227-025-07337-0
    gives an improved algorithm, which they are using for the
    supercomputer version.a They are doing:

    n = n0
    repeat
    aa n = n + 1
    aa alpha = ctz(n)
    aa n = n * 3^alpha / 2^alpha

    Strange syntax...

    ctz(n) must count trailing zeroes, right?
    If so, then ctz(n+1) will return 1 or more for odd n, so that
    gives 3n+3, which after the division by 2 returns a value which
    is one too high.
    aa n = n - 1

    but this is corrected here!

    The key must be that the multiplication by a power of three also
    works for n with multiple trailing 1 bits.

    The "x/2^alpha" part is of course just x>>alpha.

    For an even n they still do a multiplication by 1 and a SHR 0, but
    the code is branchless.

    aa beta = ctz(n)
    aa n = n / 2^beta
    until n < n_0

    I am guessing that the best way to do the n*3^alpha parts is a
    table of powers of 3, so

    let mut n = n0;
    loop {
    a n += 1;
    a let alpha = ctz(n);
    a n = (n * power_of_three[alpha]) >> alpha;
    a n -= 1;
    a beta = ctz(n);
    a n >>= beta;
    a if n <= n_0 {break};
    }


    They are also playing some strange tricks with parallel solving
    that are not described clearly, or I am not clever enough to
    understand what they are doing. I would have to look at their
    code, I guess.
    :-)

    Terje

    How many iterations are done in one loop?

    Worst case - few thousands, or less. I didn't yet encounter 1000.
    Mean - 6 or 7. May be, 8, but that's unlikely.

    The above mean value is for completely non-filtered inputs.
    If we apply a very basic filtering, only looking at inputs in form 4*x+3
    then an average # of steps is bigger - somewhere in range [15:20].
    So, for filtered inputs higher-order steps start to make sense.
    Or not. It strongly depends on specific FPGA model.
    I think it would need to be
    at least 10 on average to beat the simple approach. The simple
    approach only taking one clock per iteration. The 128-bit multiply
    and 128-bit count trailing zeros are slow in an FPGA.

    And, can the approach consume a more limited number of iterations?
    For example counting only the 6 LSB zeros?

    I found a mistake in my code handling groups. It can only do about
    a 24x24=576 matrix of calculations in parallel @ 200MHz. So that's
    1/4 million years?


    Much less.
    My 1st estimate was for more stupid loop that always counts down to 1.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Feb 17 15:41:23 2026
    From Newsgroup: comp.arch

    On Mon, 16 Feb 2026 12:31:31 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Sun, 15 Feb 2026 19:19:16 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:

    "Unknown" is a correct theoretically answer, but I can provide a
    very educated guess that on average it would take ~500 steps and
    the worst case would not exceed 2000 steps.
    So, bad news are that overall it will take more than a million
    years on your machine. And good news are that it'll take less than
    2 million years.

    Not as bad, very many simplifications are possible.

    For one, you can discount all even numbers as starting values.

    O.k.

    You can go further: If you look at residues modulo 2^n of your
    starting numbers, there are only A076227(n) "interesting" remaiders
    that you need to look at.

    Can not say that I understood.

    If you take n=39, only 3611535862 cases
    need to be looked at, so only 3611535862/2^39 of all cases need to
    be looked at, a reduction of a factor around 150. You can also
    do some more reductions (only 2/3 of those numbers need checking
    if you can calculate the remainder by 3 fast). There are different tradeoffs for different n, of course (a factor of 16 for n=10,
    for example).



    Hopefully now I figured out what you are talking about, except of 2/3
    part and apart from the meaning of A076227(n).
    I experimented a little with these ideas and it seems that reduction in
    number of steps is nowhere near the reduction in number of tested cases.
    My test vectors were 2**32 consecutive points starting at 2**52 or
    2**53. I tried different modulo filters in range from 2**5 to 2**31.
    Two modes of filtering were used.
    In "simple" mode I just filtered out "non-interesting" inputs, but
    did nothing special about "interesting" inputs.
    In "advanced" mode I accelerated processing of "interesting" inputs by
    means of starting their testing from point x0= (x/M)*A[x%M]+B[x%M],
    where tables A[] and B[] were pre-calculated during filter build step. Obviously, "advanced" mode filtering requires ~3 times more memory plus multiplier + plus some logic.

    Here are my results:
    $ ./nsteps2 4503599627370496 0x100000000
    Initial value = 4503599627370496 (0010000000000000).
    #values = 4294967296 (0000000100000000)
    Unfiltered : 22501638882 steps. 4294967296 tests. 5.239 steps/iter
    n*4+3 : 17132929762 steps. 1073741824 tests. 15.956 steps/iter
    n*2**5+m : 15924970210 steps. 939524096 tests. 16.950 steps/iter
    n*2**5+m * : 8408777442 steps. 939524096 tests. 8.950 steps/iter
    n*2**10+m : 11948770018 steps. 503316480 tests. 23.740 steps/iter
    n*2**10+m * : 7088288929 steps. 503316480 tests. 14.083 steps/iter
    n*2**15+m : 9003713250 steps. 289013760 tests. 31.153 steps/iter
    n*2**15+m * : 4818894105 steps. 289013760 tests. 16.674 steps/iter
    n*2**20+m : 6973854434 steps. 181379072 tests. 38.449 steps/iter
    n*2**20+m * : 3496505674 steps. 181379072 tests. 19.277 steps/iter
    n*2**25+m : 5574414306 steps. 125809408 tests. 44.308 steps/iter
    n*2**25+m * : 2543845543 steps. 125809408 tests. 20.220 steps/iter
    n*2**29+m : 4660796778 steps. 95710352 tests. 48.697 steps/iter
    n*2**29+m * : 2008917951 steps. 95710352 tests. 20.990 steps/iter
    n*2**30+m : 4493358006 steps. 89757100 tests. 50.061 steps/iter
    n*2**30+m * : 1915335033 steps. 89757100 tests. 21.339 steps/iter
    n*2**31+m : 4212222570 steps. 81419608 tests. 51.735 steps/iter
    n*2**31+m * : 1780554493 steps. 81419608 tests. 21.869 steps/iter

    $ ./nsteps2 0x20000000000000 0x100000000
    Initial value = 9007199254740992 (0020000000000000).
    #values = 4294967296 (0000000100000000)
    Unfiltered : 22501540484 steps. 4294967296 tests. 5.239 steps/iter
    n*4+3 : 17132831364 steps. 1073741824 tests. 15.956 steps/iter
    n*2**5+m : 15924871812 steps. 939524096 tests. 16.950 steps/iter
    n*2**5+m * : 8408679044 steps. 939524096 tests. 8.950 steps/iter
    n*2**10+m : 11948671620 steps. 503316480 tests. 23.740 steps/iter
    n*2**10+m * : 7088271089 steps. 503316480 tests. 14.083 steps/iter
    n*2**15+m : 9003614852 steps. 289013760 tests. 31.153 steps/iter
    n*2**15+m * : 4818766141 steps. 289013760 tests. 16.673 steps/iter
    n*2**20+m : 6973756036 steps. 181379072 tests. 38.449 steps/iter
    n*2**20+m * : 3496390679 steps. 181379072 tests. 19.277 steps/iter
    n*2**25+m : 5574315908 steps. 125809408 tests. 44.308 steps/iter
    n*2**25+m * : 2543741535 steps. 125809408 tests. 20.219 steps/iter
    n*2**29+m : 4660698380 steps. 95710352 tests. 48.696 steps/iter
    n*2**29+m * : 2008935493 steps. 95710352 tests. 20.990 steps/iter
    n*2**30+m : 4493259608 steps. 89757100 tests. 50.060 steps/iter
    n*2**30+m * : 1915241920 steps. 89757100 tests. 21.338 steps/iter
    n*2**31+m : 4212124172 steps. 81419608 tests. 51.734 steps/iter
    n*2**31+m * : 1780581916 steps. 81419608 tests. 21.869 steps/iter
    'Advanced' mode marked with '*'

    As you can see, 2**10 filter reduces # of processed points by factor of
    8.5, but number of processing steps reduced only 1.9x in simple mode
    and 3.2x in advanced mode.
    For 2**20 filter ratios are, respectively, 23.7, 3.2 and 6.4
    For 2**31 filter, 52.7, 5.3 and 12.6

    Pay attention that for implementation in FPGA even filtering with
    modulo 2**29 (~12M entries) is impractical. Assuming advanced mode, 12M
    entries occupy 1000 Mbits. In non-expensive FPGA it simply does not
    fit. In very expensive FPGA it fits, but barely.

    Let's take, for example, Altera Stratix-10 GX 2800. By the standard of
    previous decade, it is a real behemoth with 2.75M Logic Elements (LEs). Assuming that each tester unit occupies 500 LEs there is room for 5500
    units. Looking into my tables above, you want 1 filter per ~20 tester
    units, so 275 filters needed. And how many available? Just two!

    For modulo 2**20 it's a little better. Each filter consists of ~45,000
    entries and occupies ~2.7 Mbits. So you can fit 80+ filters in GX 2800.
    But you want 5500 / 19 = 290, so still unbalanced.

    For modulo 2**15 (2205 entries, ~100 Kbits) there is enough memory to
    fit desired number of replica. But small filter like that is
    significantly less effective, reducing # of steps (in advanced mode)
    only by 4.7x.

    It is possible that there exist smarter algorithms for partition of
    work between parallel tester units that can greatly reduce need for
    access to filtering tables relatively to simple scheme that I assumed
    in above estimates. But I think that they point stands - this sort of
    filtering is not a silver bullet. It can improve throughput, but it's
    unlikely that it can improve it by more than one order of magnitude.









    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Feb 17 18:17:10 2026
    From Newsgroup: comp.arch

    On Tue, 17 Feb 2026 15:41:23 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Mon, 16 Feb 2026 12:31:31 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Sun, 15 Feb 2026 19:19:16 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:

    "Unknown" is a correct theoretically answer, but I can provide a
    very educated guess that on average it would take ~500 steps and
    the worst case would not exceed 2000 steps.
    So, bad news are that overall it will take more than a million
    years on your machine. And good news are that it'll take less
    than 2 million years.

    Not as bad, very many simplifications are possible.

    For one, you can discount all even numbers as starting values.

    O.k.

    You can go further: If you look at residues modulo 2^n of your
    starting numbers, there are only A076227(n) "interesting"
    remaiders that you need to look at.

    Can not say that I understood.

    If you take n=39, only 3611535862 cases
    need to be looked at, so only 3611535862/2^39 of all cases need to
    be looked at, a reduction of a factor around 150. You can also
    do some more reductions (only 2/3 of those numbers need checking
    if you can calculate the remainder by 3 fast). There are
    different tradeoffs for different n, of course (a factor of 16
    for n=10, for example).



    Hopefully now I figured out what you are talking about, except of 2/3
    part and apart from the meaning of A076227(n).
    I experimented a little with these ideas and it seems that reduction
    in number of steps is nowhere near the reduction in number of tested
    cases. My test vectors were 2**32 consecutive points starting at
    2**52 or 2**53. I tried different modulo filters in range from 2**5
    to 2**31. Two modes of filtering were used.
    In "simple" mode I just filtered out "non-interesting" inputs, but
    did nothing special about "interesting" inputs.
    In "advanced" mode I accelerated processing of "interesting" inputs by
    means of starting their testing from point x0= (x/M)*A[x%M]+B[x%M],
    where tables A[] and B[] were pre-calculated during filter build step. Obviously, "advanced" mode filtering requires ~3 times more memory
    plus multiplier + plus some logic.

    Here are my results:
    $ ./nsteps2 4503599627370496 0x100000000
    Initial value = 4503599627370496 (0010000000000000).
    #values = 4294967296 (0000000100000000)
    Unfiltered : 22501638882 steps. 4294967296 tests. 5.239 steps/iter
    n*4+3 : 17132929762 steps. 1073741824 tests. 15.956 steps/iter
    n*2**5+m : 15924970210 steps. 939524096 tests. 16.950 steps/iter
    n*2**5+m * : 8408777442 steps. 939524096 tests. 8.950 steps/iter n*2**10+m : 11948770018 steps. 503316480 tests. 23.740 steps/iter n*2**10+m * : 7088288929 steps. 503316480 tests. 14.083 steps/iter n*2**15+m : 9003713250 steps. 289013760 tests. 31.153 steps/iter n*2**15+m * : 4818894105 steps. 289013760 tests. 16.674 steps/iter n*2**20+m : 6973854434 steps. 181379072 tests. 38.449 steps/iter n*2**20+m * : 3496505674 steps. 181379072 tests. 19.277 steps/iter n*2**25+m : 5574414306 steps. 125809408 tests. 44.308 steps/iter n*2**25+m * : 2543845543 steps. 125809408 tests. 20.220 steps/iter n*2**29+m : 4660796778 steps. 95710352 tests. 48.697 steps/iter n*2**29+m * : 2008917951 steps. 95710352 tests. 20.990 steps/iter n*2**30+m : 4493358006 steps. 89757100 tests. 50.061 steps/iter n*2**30+m * : 1915335033 steps. 89757100 tests. 21.339 steps/iter n*2**31+m : 4212222570 steps. 81419608 tests. 51.735 steps/iter n*2**31+m * : 1780554493 steps. 81419608 tests. 21.869 steps/iter

    $ ./nsteps2 0x20000000000000 0x100000000
    Initial value = 9007199254740992 (0020000000000000).
    #values = 4294967296 (0000000100000000)
    Unfiltered : 22501540484 steps. 4294967296 tests. 5.239 steps/iter
    n*4+3 : 17132831364 steps. 1073741824 tests. 15.956 steps/iter
    n*2**5+m : 15924871812 steps. 939524096 tests. 16.950 steps/iter
    n*2**5+m * : 8408679044 steps. 939524096 tests. 8.950 steps/iter n*2**10+m : 11948671620 steps. 503316480 tests. 23.740 steps/iter n*2**10+m * : 7088271089 steps. 503316480 tests. 14.083 steps/iter n*2**15+m : 9003614852 steps. 289013760 tests. 31.153 steps/iter n*2**15+m * : 4818766141 steps. 289013760 tests. 16.673 steps/iter n*2**20+m : 6973756036 steps. 181379072 tests. 38.449 steps/iter n*2**20+m * : 3496390679 steps. 181379072 tests. 19.277 steps/iter n*2**25+m : 5574315908 steps. 125809408 tests. 44.308 steps/iter n*2**25+m * : 2543741535 steps. 125809408 tests. 20.219 steps/iter n*2**29+m : 4660698380 steps. 95710352 tests. 48.696 steps/iter n*2**29+m * : 2008935493 steps. 95710352 tests. 20.990 steps/iter n*2**30+m : 4493259608 steps. 89757100 tests. 50.060 steps/iter n*2**30+m * : 1915241920 steps. 89757100 tests. 21.338 steps/iter n*2**31+m : 4212124172 steps. 81419608 tests. 51.734 steps/iter n*2**31+m * : 1780581916 steps. 81419608 tests. 21.869 steps/iter 'Advanced' mode marked with '*'

    As you can see, 2**10 filter reduces # of processed points by factor
    of 8.5, but number of processing steps reduced only 1.9x in simple
    mode and 3.2x in advanced mode.
    For 2**20 filter ratios are, respectively, 23.7, 3.2 and 6.4
    For 2**31 filter, 52.7, 5.3 and 12.6



    There was bug in my testing code. In reality filtering is somewhat more efficient. Here are corrected results:

    Initial value = 4503599627370496 (0010000000000000).
    #values = 4294967296 (0000000100000000)
    Unfiltered : 22501638882 steps. 4294967296 tests. 5.239 steps/iter
    n*4+3 : 17132929762 steps. 1073741824 tests. 15.956 steps/iter
    n*2**5+m : 13374833378 steps. 536870912 tests. 24.913 steps/iter
    n*2**5+m * : 8408777442 steps. 536870912 tests. 15.663 steps/iter
    n*2**10+m : 9935504098 steps. 268435456 tests. 37.013 steps/iter
    n*2**10+m * : 5187551970 steps. 268435456 tests. 19.325 steps/iter
    n*2**15+m : 7857750754 steps. 169738240 tests. 46.293 steps/iter
    n*2**15+m * : 3449537250 steps. 169738240 tests. 20.323 steps/iter
    n*2**20+m : 6242468578 steps. 111935488 tests. 55.768 steps/iter
    n*2**20+m * : 2410906338 steps. 111935488 tests. 21.538 steps/iter
    n*2**25+m : 4838650338 steps. 73364736 tests. 65.953 steps/iter
    n*2**25+m * : 1716934242 steps. 73364736 tests. 23.403 steps/iter
    n*2**29+m : 3856366938 steps. 51085096 tests. 75.489 steps/iter
    n*2**29+m * : 1338119106 steps. 51085096 tests. 26.194 steps/iter
    n*2**30+m : 3856366938 steps. 51085096 tests. 75.489 steps/iter
    n*2**30+m * : 1261491462 steps. 51085096 tests. 24.694 steps/iter
    n*2**31+m : 3666319938 steps. 47284156 tests. 77.538 steps/iter
    n*2**31+m * : 1184863818 steps. 47284156 tests. 25.058 steps/iter

    Initial value = 9007199254740992 (0020000000000000).
    #values = 4294967296 (0000000100000000)
    Unfiltered : 22501540484 steps. 4294967296 tests. 5.239 steps/iter
    n*4+3 : 17132831364 steps. 1073741824 tests. 15.956 steps/iter
    n*2**5+m : 13374734980 steps. 536870912 tests. 24.912 steps/iter
    n*2**5+m * : 8408679044 steps. 536870912 tests. 15.662 steps/iter
    n*2**10+m : 9935405700 steps. 268435456 tests. 37.012 steps/iter
    n*2**10+m * : 5187453572 steps. 268435456 tests. 19.325 steps/iter
    n*2**15+m : 7857652356 steps. 169738240 tests. 46.293 steps/iter
    n*2**15+m * : 3449438852 steps. 169738240 tests. 20.322 steps/iter
    n*2**20+m : 6242370180 steps. 111935488 tests. 55.768 steps/iter
    n*2**20+m * : 2410807940 steps. 111935488 tests. 21.537 steps/iter
    n*2**25+m : 4838551940 steps. 73364736 tests. 65.952 steps/iter
    n*2**25+m * : 1716835844 steps. 73364736 tests. 23.401 steps/iter
    n*2**29+m : 3856268540 steps. 51085096 tests. 75.487 steps/iter
    n*2**29+m * : 1338020708 steps. 51085096 tests. 26.192 steps/iter
    n*2**30+m : 3856268540 steps. 51085096 tests. 75.487 steps/iter
    n*2**30+m * : 1261393064 steps. 51085096 tests. 24.692 steps/iter
    n*2**31+m : 3666221540 steps. 47284156 tests. 77.536 steps/iter
    n*2**31+m * : 1184765420 steps. 47284156 tests. 25.056 steps/iter

    2**10 filter reduces # of processed points by factor of 16.0, but
    number of processing steps reduced only 2.7x in simple mode and 4.3x
    in advanced mode.
    For 2**20 filter ratios are, respectively, 38.4, 3.6 and 9.3
    For 2**31 filter: 90.8, 6.1 and 19.0


    Pay attention that for implementation in FPGA even filtering with
    modulo 2**29 (~12M entries) is impractical. Assuming advanced mode,
    12M entries occupy 1000 Mbits. In non-expensive FPGA it simply does
    not fit. In very expensive FPGA it fits, but barely.

    Let's take, for example, Altera Stratix-10 GX 2800. By the standard of previous decade, it is a real behemoth with 2.75M Logic Elements
    (LEs). Assuming that each tester unit occupies 500 LEs there is room
    for 5500 units. Looking into my tables above, you want 1 filter per
    ~20 tester units, so 275 filters needed. And how many available? Just
    two!


    And here I made even bigger mistake. The size of internal memory
    of GX 2800 is 229 Mbit. So 2**29 filter does not fit at all.

    For modulo 2**20 it's a little better. Each filter consists of ~45,000 entries and occupies ~2.7 Mbits. So you can fit 80+ filters in GX
    2800. But you want 5500 / 19 = 290, so still unbalanced.


    Here correct numbers are as folowing:
    Each filter consists of 27,328 entries and occupies ~1.7 Mbits. So you
    can fit 130+ filters in GX 2800. But you want 5500 / 19 = 290, so still unbalanced.

    For modulo 2**15 (2205 entries, ~100 Kbits) there is enough memory to
    fit desired number of replica. But small filter like that is
    significantly less effective, reducing # of steps (in advanced mode)
    only by 4.7x.


    Here it should be:
    For modulo 2**15 (1295 entries, ~60 Kbits) there is enough memory to
    fit desired number of replica. But small filter like that is
    significantly less effective, reducing # of steps (in advanced mode)
    only by 6.5x.

    It is possible that there exist smarter algorithms for partition of
    work between parallel tester units that can greatly reduce need for
    access to filtering tables relatively to simple scheme that I assumed
    in above estimates. But I think that they point stands - this sort of filtering is not a silver bullet. It can improve throughput, but it's unlikely that it can improve it by more than one order of magnitude.


    Despite corrected mistakes my conclusions are the same.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Feb 18 18:22:51 2026
    From Newsgroup: comp.arch

    On Sun, 15 Feb 2026 10:56:51 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    I have coded a version that tests up to 8192 number simultaneously.
    It is supposed to be able to run at over 200 MHz. Some of the number
    would take a large number of iterations though.

    I found something strange, the code to count for more numbers does
    not increase the size of the core very much. (But it takes longer to
    build.) I am not sure if this occurs because there is a mitsake
    somewhere, or if the tools are able to figure out how to share all
    the logic. I am thinking the logic for adders is shared somehow.
    128-bit arithmetic is being used.


    There is a mistake in your code. No other explanation is possible.

    You should look for 3 sorts of mistakes
    1. Core itself.
    The most likely mistake is failing to look for one of the ending
    conditions. You have to look for 3 conditions:
    a) - value produced at current step is smaller than initial value.
    That is successful ending. Here you signal to the scheduler that you
    are ready to accept the next value.
    b) - overflow. Next value exceeds the width of register.
    c) - timeout. You were running for more than N steps. N=8192 looks
    like the reasonable value.
    a) and b) are suspected failures.
    The handling for b) and c) is similar - you raise alert signal and stop.
    Your test unit has to provide supervisor circuity with the mean to
    fetch the initial value that caused suspected failure and to clear
    alert condition.
    After alert is cleared your unit returns to the same state as after
    success - waiting for next initial value.

    2. Scheduler.
    In proof-of-concept design the scheduler does not have to be very
    realistic. For example, it does not have to be efficiently pipelined.
    But it should be able to monitor ready signals of all test units and to
    supply different initial data to different units.
    Supplying the same data to all units simultaneously is a sort of
    mistake that explains your observed behavior.

    3. Supervisor
    Again, in proof-of-concept design the supervisor does not have to be
    very realistic. But it should be able to selectively exercise all core
    feature related to alert, including fetching of initial data and
    clearing of alert.

    After you coded everything correctly, do not even try to compile test
    with 8000 units. That is totally unrealistic. Even if you find huge
    FPGA in which such number of testers could fit (and I am not sure that
    such FPGA exists) the compilation will take many hours, possibly days.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Feb 18 21:56:42 2026
    From Newsgroup: comp.arch

    Stephen Fuld wrote:
    On 2/14/2026 1:57 AM, Thomas Koenig wrote:
    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture.-a Graphics cards have sped this
    up enormously.

    big snip


    The number of iterations would be tracked; if larger than a prevous
    record, the calculation would terminate with a corresponding signal
    to the outside world.

    Also, each cycle, the result from the previous round would be
    compared against the input value.-a If it is lower, the next number
    would be chosen.-a Finally, the number reached would be checked against
    the previous, pre-set maximum, with would also signal to the outside
    world.

    Hold these thoughts

    snip

    The problem is embarassingly parallel.


    Sort of.-a You could do it that way, but you would be giving up some performance.-a Specifically, as you say, you want to check the number against the previous input value and the preset maximum.-a But in a
    parallel implementation, you want to check against the largest value
    checked *across all the parallel threads*.-a In order to do that, the threads need to communicate, thus violating the "embarrassingly
    parallel" rule.-a I don't know how much performance you would give up so
    it may not make much difference, nor do I know the cost of keeping some "global" largest value checked so far.-a There is a tradeoff here.
    KISS.
    In this case that means "don't care". It is perfectly OK to only update
    the "global max" each time a thread have finished a large block of
    candidates, so just checking against the starting value is fine.
    Assuming we also want to track the largest seen intermediate value, that could also be updated the same way.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Feb 18 22:56:19 2026
    From Newsgroup: comp.arch

    On 2026-02-18 11:22 a.m., Michael S wrote:
    On Sun, 15 Feb 2026 10:56:51 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    I have coded a version that tests up to 8192 number simultaneously.
    It is supposed to be able to run at over 200 MHz. Some of the number
    would take a large number of iterations though.

    I found something strange, the code to count for more numbers does
    not increase the size of the core very much. (But it takes longer to
    build.) I am not sure if this occurs because there is a mitsake
    somewhere, or if the tools are able to figure out how to share all
    the logic. I am thinking the logic for adders is shared somehow.
    128-bit arithmetic is being used.


    There is a mistake in your code. No other explanation is possible.

    You should look for 3 sorts of mistakes
    1. Core itself.
    The most likely mistake is failing to look for one of the ending
    conditions. You have to look for 3 conditions:
    a) - value produced at current step is smaller than initial value.
    That is successful ending. Here you signal to the scheduler that you
    are ready to accept the next value.
    b) - overflow. Next value exceeds the width of register.
    c) - timeout. You were running for more than N steps. N=8192 looks
    like the reasonable value.
    a) and b) are suspected failures.
    The handling for b) and c) is similar - you raise alert signal and stop.
    Your test unit has to provide supervisor circuity with the mean to
    fetch the initial value that caused suspected failure and to clear
    alert condition.
    After alert is cleared your unit returns to the same state as after
    success - waiting for next initial value.

    I found the mistake. Core was working, but groups of cores were not.
    Done signals were missing.


    2. Scheduler.
    In proof-of-concept design the scheduler does not have to be very
    realistic. For example, it does not have to be efficiently pipelined.
    But it should be able to monitor ready signals of all test units and to supply different initial data to different units.
    Supplying the same data to all units simultaneously is a sort of
    mistake that explains your observed behavior.

    There is really simple scheduling ATM. There is just a bunch of done
    signals that are gated together. Meaning the slowest calculation of the
    group will slow everything else down. Once all the dones are true the
    next set of numbers are loaded.

    3. Supervisor
    Again, in proof-of-concept design the supervisor does not have to be
    very realistic. But it should be able to selectively exercise all core feature related to alert, including fetching of initial data and
    clearing of alert.

    After you coded everything correctly, do not even try to compile test
    with 8000 units. That is totally unrealistic. Even if you find huge
    FPGA in which such number of testers could fit (and I am not sure that
    such FPGA exists) the compilation will take many hours, possibly days.



    Yeah 8,000 was definitely out of the picture. I had since coded 24x24
    matrix (576) and although the logic would fit that turned out not to be
    able to route. So, it is down to 20x20. 20x20 with no real scheduling fits.

    I tried displaying progress as an on-screen bitmap, but that just
    appeared noisy. I have been working on an SCP so may be able to use that
    to manage the scheduling with some custom logic.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Feb 19 12:10:41 2026
    From Newsgroup: comp.arch

    On Wed, 18 Feb 2026 22:56:19 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    On 2026-02-18 11:22 a.m., Michael S wrote:
    On Sun, 15 Feb 2026 10:56:51 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    I have coded a version that tests up to 8192 number simultaneously.
    It is supposed to be able to run at over 200 MHz. Some of the
    number would take a large number of iterations though.

    I found something strange, the code to count for more numbers does
    not increase the size of the core very much. (But it takes longer
    to build.) I am not sure if this occurs because there is a mitsake
    somewhere, or if the tools are able to figure out how to share all
    the logic. I am thinking the logic for adders is shared somehow.
    128-bit arithmetic is being used.


    There is a mistake in your code. No other explanation is possible.

    You should look for 3 sorts of mistakes
    1. Core itself.
    The most likely mistake is failing to look for one of the ending conditions. You have to look for 3 conditions:
    a) - value produced at current step is smaller than initial value.
    That is successful ending. Here you signal to the scheduler that
    you are ready to accept the next value.
    b) - overflow. Next value exceeds the width of register.
    c) - timeout. You were running for more than N steps. N=8192 looks
    like the reasonable value.
    a) and b) are suspected failures.
    The handling for b) and c) is similar - you raise alert signal and
    stop. Your test unit has to provide supervisor circuity with the
    mean to fetch the initial value that caused suspected failure and
    to clear alert condition.
    After alert is cleared your unit returns to the same state as after
    success - waiting for next initial value.

    I found the mistake. Core was working, but groups of cores were not.
    Done signals were missing.


    2. Scheduler.
    In proof-of-concept design the scheduler does not have to be very realistic. For example, it does not have to be efficiently
    pipelined. But it should be able to monitor ready signals of all
    test units and to supply different initial data to different units. Supplying the same data to all units simultaneously is a sort of
    mistake that explains your observed behavior.

    There is really simple scheduling ATM. There is just a bunch of done
    signals that are gated together. Meaning the slowest calculation of
    the group will slow everything else down. Once all the dones are true
    the next set of numbers are loaded.

    3. Supervisor
    Again, in proof-of-concept design the supervisor does not have to be
    very realistic. But it should be able to selectively exercise all
    core feature related to alert, including fetching of initial data
    and clearing of alert.

    After you coded everything correctly, do not even try to compile
    test with 8000 units. That is totally unrealistic. Even if you find
    huge FPGA in which such number of testers could fit (and I am not
    sure that such FPGA exists) the compilation will take many hours,
    possibly days.



    Yeah 8,000 was definitely out of the picture. I had since coded 24x24
    matrix (576) and although the logic would fit that turned out not to
    be able to route. So, it is down to 20x20. 20x20 with no real
    scheduling fits.

    I tried displaying progress as an on-screen bitmap, but that just
    appeared noisy. I have been working on an SCP so may be able to use
    that to manage the scheduling with some custom logic.




    Does 20x20 means 400 cores with something like 80-bit initial values
    and 128-bit processing?
    If you didn't make further mistakes then it should be rather big, order
    of 300K LEs. It would not fit in the biggest low-end Altera device
    (Cyclone 10GX 10CX220).
    On the Xilinx side, it would not fit in any Spartan and in most Artix
    devices. May be, it would fit in the biggest Artix UltraScale+ (AU25P).
    But more realistically for 400 test units one would want Arria 10 on
    Altera side or Kintex on Xilinx side. Both are not prohibitively
    expensive, but not cheap.

    What device are you using?
    If it is smaller than Kintex KU5P then I'd strongly suspect that you
    didn't clean out all mistakes.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Feb 19 11:51:03 2026
    From Newsgroup: comp.arch

    Robert Finch wrote:
    On 2026-02-18 11:22 a.m., Michael S wrote:
    On Sun, 15 Feb 2026 10:56:51 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    I have coded a version that tests up to 8192 number simultaneously.
    It is supposed to be able to run at over 200 MHz. Some of the number
    would take a large number of iterations though.

    I found something strange, the code to count for more numbers does
    not increase the size of the core very much. (But it takes longer to
    build.) I am not sure if this occurs because there is a mitsake
    somewhere, or if the tools are able to figure out how to share all
    the logic. I am thinking the logic for adders is shared somehow.
    128-bit arithmetic is being used.


    There is a mistake in your code. No other explanation is possible.

    You should look for 3 sorts of mistakes
    1. Core itself.
    The most likely mistake is failing to look for one of the ending
    conditions. You have to look for 3 conditions:
    a) - value produced at current step is smaller than initial value.
    -a That is successful ending. Here you signal to the scheduler that you
    -a are ready to accept the next value.
    b) - overflow. Next value exceeds the width of register.
    c) - timeout. You were running for more than N steps. N=8192 looks
    like the reasonable value.
    a) and b) are suspected failures.
    The handling for b) and c) is similar - you raise alert signal and stop.
    Your test unit has to provide supervisor circuity with the mean to
    fetch the initial value that caused suspected failure and to clear
    alert condition.
    After alert is cleared your unit returns to the same state as after
    success - waiting for next initial value.

    I found the mistake. Core was working, but groups of cores were not.
    Done signals were missing.


    2. Scheduler.
    In proof-of-concept design the scheduler does not have to be very
    realistic. For example, it does not have to be efficiently pipelined.
    But it should be able to monitor ready signals of all test units and to
    supply different initial data to different units.
    Supplying the same data to all units simultaneously is a sort of
    mistake that explains your observed behavior.

    There is really simple scheduling ATM. There is just a bunch of done
    signals that are gated together. Meaning the slowest calculation of the group will slow everything else down. Once all the dones are true the
    next set of numbers are loaded.
    The obvious choice is to have a work-to-do list, each work item
    containing a start-end set of numbers to test.
    Each worker thread would simply finish its current task, store the/any results, and then ask for the next available task.
    A single producer - multiple consumers queue like this is very simple to get right with minimum admins: Just a single LOCK XADD to update the
    queue back index, grabbing the corresponding task.
    I used this exact setup for my wire speed multi-threaded ntpd deamon
    about two decades ago. (I did need a few more XADD instructions to
    handle queue full/empty situations which are not applicable for the
    current problem.)
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Feb 19 15:27:07 2026
    From Newsgroup: comp.arch

    I must say I find this whole thread baffling: when searching for
    a counter example and we're already in the order of 2^70, it seems clear
    that constant-factor improvements are completely irrelevant (even with
    large constants).

    I guess it all boils down to the "useless" part of the title?


    === Stefan
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Feb 19 23:12:41 2026
    From Newsgroup: comp.arch

    On Thu, 19 Feb 2026 11:51:03 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    Robert Finch wrote:
    On 2026-02-18 11:22 a.m., Michael S wrote:
    On Sun, 15 Feb 2026 10:56:51 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    I have coded a version that tests up to 8192 number
    simultaneously. It is supposed to be able to run at over 200 MHz.
    Some of the number would take a large number of iterations though.

    I found something strange, the code to count for more numbers does
    not increase the size of the core very much. (But it takes longer
    to build.) I am not sure if this occurs because there is a mitsake
    somewhere, or if the tools are able to figure out how to share all
    the logic. I am thinking the logic for adders is shared somehow.
    128-bit arithmetic is being used.


    There is a mistake in your code. No other explanation is possible.

    You should look for 3 sorts of mistakes
    1. Core itself.
    The most likely mistake is failing to look for one of the ending
    conditions. You have to look for 3 conditions:
    a) - value produced at current step is smaller than initial value.
    a That is successful ending. Here you signal to the scheduler that
    you are ready to accept the next value.
    b) - overflow. Next value exceeds the width of register.
    c) - timeout. You were running for more than N steps. N=8192 looks
    like the reasonable value.
    a) and b) are suspected failures.
    The handling for b) and c) is similar - you raise alert signal and
    stop. Your test unit has to provide supervisor circuity with the
    mean to fetch the initial value that caused suspected failure and
    to clear alert condition.
    After alert is cleared your unit returns to the same state as after
    success - waiting for next initial value.

    I found the mistake. Core was working, but groups of cores were
    not. Done signals were missing.


    2. Scheduler.
    In proof-of-concept design the scheduler does not have to be very
    realistic. For example, it does not have to be efficiently
    pipelined. But it should be able to monitor ready signals of all
    test units and to supply different initial data to different units.
    Supplying the same data to all units simultaneously is a sort of
    mistake that explains your observed behavior.

    There is really simple scheduling ATM. There is just a bunch of
    done signals that are gated together. Meaning the slowest
    calculation of the group will slow everything else down. Once all
    the dones are true the next set of numbers are loaded.

    The obvious choice is to have a work-to-do list, each work item
    containing a start-end set of numbers to test.

    Each worker thread would simply finish its current task, store
    the/any results, and then ask for the next available task.

    A single producer - multiple consumers queue like this is very simple
    to get right with minimum admins: Just a single LOCK XADD to update
    the queue back index, grabbing the corresponding task.

    There is no XADD in FPGA, just gate, multipliers and memory blocks.
    I used this exact setup for my wire speed multi-threaded ntpd deamon
    about two decades ago. (I did need a few more XADD instructions to
    handle queue full/empty situations which are not applicable for the
    current problem.)

    Terje

    Did your ntpd deamon scale to 300-400 threads?
    300 is the # of test units that fits in FPGAs that appears to be the
    most cost-effective for this sort of job, e.g. 10CX220YU484E5G.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Feb 20 08:01:47 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:

    You should look for 3 sorts of mistakes
    1. Core itself.
    The most likely mistake is failing to look for one of the ending
    conditions. You have to look for 3 conditions:
    a) - value produced at current step is smaller than initial value.
    That is successful ending. Here you signal to the scheduler that you
    are ready to accept the next value.
    b) - overflow. Next value exceeds the width of register.
    c) - timeout. You were running for more than N steps. N=8192 looks
    like the reasonable value.

    Looking at the links in https://oeis.org/A006878 , anything above
    2442 would be a new record, so 8192 would be plenty.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Feb 20 11:41:53 2026
    From Newsgroup: comp.arch

    On Thu, 19 Feb 2026 15:27:07 -0500
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:

    I must say I find this whole thread baffling: when searching for
    a counter example and we're already in the order of 2^70, it seems
    clear that constant-factor improvements are completely irrelevant
    (even with large constants).

    I guess it all boils down to the "useless" part of the title?


    === Stefan


    As I wrote in earlier post, "useless" sounds too optimistic.
    "Useless" in my book is proving or disproving 99% of conjectures known
    in the numbers theory, both obscure and extremely famous (Fermat's
    Last Theorem). And then 1% process ends up very useful.

    But Collatz's conjecture believed to be true. Trying to disprove it by experiments is futile.
    Finding that all sequence with initial value in range [2**71:2**72]
    converge to 1, which is by far the most likely outcome of experiment,
    leaves us in exactly the same place where we started, i.e. conjecture
    believed to be true, but unproven. That's not good enough to warrant a
    title of "useless". "Pointless" sounds more proper, but may be we shall
    find something more derogatory.

    However, it's fun.
    My current estimate, after considering how "smart" filtering by modulo
    2**20 can be made practical, is that a single FPGA that costs 300 USD
    would be able to test 2**71 values in ~500 years. Which means that the
    company I am working for can be convinced to build gear capable of
    doing calculations in 1 year for something like 1M USD or a little more.
    I'd guess that smaller and hungrier engineering firms would accept the
    job for 1/3rd of that sum.
    So, who has a pocket money to spend?


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Feb 20 15:11:26 2026
    From Newsgroup: comp.arch

    Michael S wrote:
    On Thu, 19 Feb 2026 11:51:03 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Robert Finch wrote:
    On 2026-02-18 11:22 a.m., Michael S wrote:
    On Sun, 15 Feb 2026 10:56:51 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    I have coded a version that tests up to 8192 number
    simultaneously. It is supposed to be able to run at over 200 MHz.
    Some of the number would take a large number of iterations though.

    I found something strange, the code to count for more numbers does
    not increase the size of the core very much. (But it takes longer
    to build.) I am not sure if this occurs because there is a mitsake
    somewhere, or if the tools are able to figure out how to share all
    the logic. I am thinking the logic for adders is shared somehow.
    128-bit arithmetic is being used.


    There is a mistake in your code. No other explanation is possible.

    You should look for 3 sorts of mistakes
    1. Core itself.
    The most likely mistake is failing to look for one of the ending
    conditions. You have to look for 3 conditions:
    a) - value produced at current step is smaller than initial value.
    -a That is successful ending. Here you signal to the scheduler that
    you are ready to accept the next value.
    b) - overflow. Next value exceeds the width of register.
    c) - timeout. You were running for more than N steps. N=8192 looks>>>> like the reasonable value.
    a) and b) are suspected failures.
    The handling for b) and c) is similar - you raise alert signal and
    stop. Your test unit has to provide supervisor circuity with the
    mean to fetch the initial value that caused suspected failure and
    to clear alert condition.
    After alert is cleared your unit returns to the same state as after
    success - waiting for next initial value.

    I found the mistake. Core was working, but groups of cores were
    not. Done signals were missing.


    2. Scheduler.
    In proof-of-concept design the scheduler does not have to be very
    realistic. For example, it does not have to be efficiently
    pipelined. But it should be able to monitor ready signals of all
    test units and to supply different initial data to different units.
    Supplying the same data to all units simultaneously is a sort of
    mistake that explains your observed behavior.

    There is really simple scheduling ATM. There is just a bunch of
    done signals that are gated together. Meaning the slowest
    calculation of the group will slow everything else down. Once all
    the dones are true the next set of numbers are loaded.

    The obvious choice is to have a work-to-do list, each work item
    containing a start-end set of numbers to test.

    Each worker thread would simply finish its current task, store
    the/any results, and then ask for the next available task.

    A single producer - multiple consumers queue like this is very simple
    to get right with minimum admins: Just a single LOCK XADD to update
    the queue back index, grabbing the corresponding task.


    There is no XADD in FPGA, just gate, multipliers and memory blocks.
    Oops! :-)

    I used this exact setup for my wire speed multi-threaded ntpd deamon
    about two decades ago. (I did need a few more XADD instructions to
    handle queue full/empty situations which are not applicable for the
    current problem.)


    Did your ntpd deamon scale to 300-400 threads?
    300 is the # of test units that fits in FPGAs that appears to be the
    most cost-effective for this sort of job, e.g. 10CX220YU484E5G.
    Scaling does not depend on the number of threads, only on the frequency
    of pulls from the queue: My code was very latency sensitive so I could
    never batch up multiple requests, while this particular
    "Gedankenexperiment" can very easily use arbitrarily large work lists,
    right?
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Feb 21 01:56:03 2026
    From Newsgroup: comp.arch

    On 2026-02-19 5:10 a.m., Michael S wrote:
    On Wed, 18 Feb 2026 22:56:19 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    On 2026-02-18 11:22 a.m., Michael S wrote:
    On Sun, 15 Feb 2026 10:56:51 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    I have coded a version that tests up to 8192 number simultaneously.
    It is supposed to be able to run at over 200 MHz. Some of the
    number would take a large number of iterations though.

    I found something strange, the code to count for more numbers does
    not increase the size of the core very much. (But it takes longer
    to build.) I am not sure if this occurs because there is a mitsake
    somewhere, or if the tools are able to figure out how to share all
    the logic. I am thinking the logic for adders is shared somehow.
    128-bit arithmetic is being used.


    There is a mistake in your code. No other explanation is possible.

    You should look for 3 sorts of mistakes
    1. Core itself.
    The most likely mistake is failing to look for one of the ending
    conditions. You have to look for 3 conditions:
    a) - value produced at current step is smaller than initial value.
    That is successful ending. Here you signal to the scheduler that
    you are ready to accept the next value.
    b) - overflow. Next value exceeds the width of register.
    c) - timeout. You were running for more than N steps. N=8192 looks
    like the reasonable value.
    a) and b) are suspected failures.
    The handling for b) and c) is similar - you raise alert signal and
    stop. Your test unit has to provide supervisor circuity with the
    mean to fetch the initial value that caused suspected failure and
    to clear alert condition.
    After alert is cleared your unit returns to the same state as after
    success - waiting for next initial value.

    I found the mistake. Core was working, but groups of cores were not.
    Done signals were missing.


    2. Scheduler.
    In proof-of-concept design the scheduler does not have to be very
    realistic. For example, it does not have to be efficiently
    pipelined. But it should be able to monitor ready signals of all
    test units and to supply different initial data to different units.
    Supplying the same data to all units simultaneously is a sort of
    mistake that explains your observed behavior.

    There is really simple scheduling ATM. There is just a bunch of done
    signals that are gated together. Meaning the slowest calculation of
    the group will slow everything else down. Once all the dones are true
    the next set of numbers are loaded.

    3. Supervisor
    Again, in proof-of-concept design the supervisor does not have to be
    very realistic. But it should be able to selectively exercise all
    core feature related to alert, including fetching of initial data
    and clearing of alert.

    After you coded everything correctly, do not even try to compile
    test with 8000 units. That is totally unrealistic. Even if you find
    huge FPGA in which such number of testers could fit (and I am not
    sure that such FPGA exists) the compilation will take many hours,
    possibly days.



    Yeah 8,000 was definitely out of the picture. I had since coded 24x24
    matrix (576) and although the logic would fit that turned out not to
    be able to route. So, it is down to 20x20. 20x20 with no real
    scheduling fits.

    I tried displaying progress as an on-screen bitmap, but that just
    appeared noisy. I have been working on an SCP so may be able to use
    that to manage the scheduling with some custom logic.




    Does 20x20 means 400 cores with something like 80-bit initial values
    and 128-bit processing?
    Yes. 128-bit processing.

    If you didn't make further mistakes then it should be rather big, order
    of 300K LEs. It would not fit in the biggest low-end Altera device
    (Cyclone 10GX 10CX220).

    IIRC it is about 150k LUTs. It uses the simplest approach, which is not
    very fast iteration wise, but it allows a lot of cores and a fast clock
    cycles time. One core is very small and has a fast iteration (1 clock
    cycle). But it takes a lot more iterations than can be done with a more brainiac approach.

    Using a more brainiac approach would likely cut the performance in half
    and use a lot more LUTs.

    On the Xilinx side, it would not fit in any Spartan and in most Artix devices. May be, it would fit in the biggest Artix UltraScale+ (AU25P).
    But more realistically for 400 test units one would want Arria 10 on
    Altera side or Kintex on Xilinx side. Both are not prohibitively
    expensive, but not cheap.

    What device are you using?
    I am not using an too inexpensive device. I have a Kintex-325 with about
    200k LUTs.

    If it is smaller than Kintex KU5P then I'd strongly suspect that you
    didn't clean out all mistakes.



    There could be more mistakes for sure. But I am sure the lowest level
    core works. It seemed to in simulation. I made a slight mod to it since
    I tested it last, checking n < startn.
    Counts are not used so the logic would be stripped out, but its good for verification.

    1 Core is:

    module Collatz_conjecture(rst, clk, startn, count, done);
    input rst;
    input clk;
    input [127:0] startn;
    output reg [127:0] count;
    output reg done;
    reg [127:0] n;

    always_ff @(posedge clk)
    if (rst) begin
    n <= startn;
    count <= 0;
    done <= FALSE;
    end
    else begin
    if (!done) begin
    if (~n[0])
    n <= n >> 2'd1;
    else
    n <= n+n+n+1;
    if (n < startn)
    done <= TRUE;
    count <= count + 1;
    end
    end

    endmodule


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sat Feb 21 19:23:02 2026
    From Newsgroup: comp.arch

    On Sat, 21 Feb 2026 01:56:03 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    On 2026-02-19 5:10 a.m., Michael S wrote:
    On Wed, 18 Feb 2026 22:56:19 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    On 2026-02-18 11:22 a.m., Michael S wrote:
    On Sun, 15 Feb 2026 10:56:51 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    I have coded a version that tests up to 8192 number
    simultaneously. It is supposed to be able to run at over 200
    MHz. Some of the number would take a large number of iterations
    though.

    I found something strange, the code to count for more numbers
    does not increase the size of the core very much. (But it takes
    longer to build.) I am not sure if this occurs because there is
    a mitsake somewhere, or if the tools are able to figure out how
    to share all the logic. I am thinking the logic for adders is
    shared somehow. 128-bit arithmetic is being used.


    There is a mistake in your code. No other explanation is possible.

    You should look for 3 sorts of mistakes
    1. Core itself.
    The most likely mistake is failing to look for one of the ending
    conditions. You have to look for 3 conditions:
    a) - value produced at current step is smaller than initial value.
    That is successful ending. Here you signal to the scheduler
    that you are ready to accept the next value.
    b) - overflow. Next value exceeds the width of register.
    c) - timeout. You were running for more than N steps. N=8192 looks
    like the reasonable value.
    a) and b) are suspected failures.
    The handling for b) and c) is similar - you raise alert signal and
    stop. Your test unit has to provide supervisor circuity with the
    mean to fetch the initial value that caused suspected failure and
    to clear alert condition.
    After alert is cleared your unit returns to the same state as
    after success - waiting for next initial value.

    I found the mistake. Core was working, but groups of cores were
    not. Done signals were missing.


    2. Scheduler.
    In proof-of-concept design the scheduler does not have to be very
    realistic. For example, it does not have to be efficiently
    pipelined. But it should be able to monitor ready signals of all
    test units and to supply different initial data to different
    units. Supplying the same data to all units simultaneously is a
    sort of mistake that explains your observed behavior.

    There is really simple scheduling ATM. There is just a bunch of
    done signals that are gated together. Meaning the slowest
    calculation of the group will slow everything else down. Once all
    the dones are true the next set of numbers are loaded.

    3. Supervisor
    Again, in proof-of-concept design the supervisor does not have to
    be very realistic. But it should be able to selectively exercise
    all core feature related to alert, including fetching of initial
    data and clearing of alert.

    After you coded everything correctly, do not even try to compile
    test with 8000 units. That is totally unrealistic. Even if you
    find huge FPGA in which such number of testers could fit (and I
    am not sure that such FPGA exists) the compilation will take many
    hours, possibly days.



    Yeah 8,000 was definitely out of the picture. I had since coded
    24x24 matrix (576) and although the logic would fit that turned
    out not to be able to route. So, it is down to 20x20. 20x20 with
    no real scheduling fits.

    I tried displaying progress as an on-screen bitmap, but that just
    appeared noisy. I have been working on an SCP so may be able to use
    that to manage the scheduling with some custom logic.




    Does 20x20 means 400 cores with something like 80-bit initial values
    and 128-bit processing?
    Yes. 128-bit processing.

    If you didn't make further mistakes then it should be rather big,
    order of 300K LEs. It would not fit in the biggest low-end Altera
    device (Cyclone 10GX 10CX220).

    IIRC it is about 150k LUTs. It uses the simplest approach, which is
    not very fast iteration wise, but it allows a lot of cores and a fast
    clock cycles time. One core is very small and has a fast iteration (1
    clock cycle). But it takes a lot more iterations than can be done
    with a more brainiac approach.

    Using a more brainiac approach would likely cut the performance in
    half and use a lot more LUTs.

    On the Xilinx side, it would not fit in any Spartan and in most
    Artix devices. May be, it would fit in the biggest Artix
    UltraScale+ (AU25P). But more realistically for 400 test units one
    would want Arria 10 on Altera side or Kintex on Xilinx side. Both
    are not prohibitively expensive, but not cheap.

    What device are you using?
    I am not using an too inexpensive device. I have a Kintex-325 with
    about 200k LUTs.

    If it is smaller than Kintex KU5P then I'd strongly suspect that you
    didn't clean out all mistakes.



    There could be more mistakes for sure. But I am sure the lowest level
    core works. It seemed to in simulation. I made a slight mod to it
    since I tested it last, checking n < startn.
    Counts are not used so the logic would be stripped out, but its good
    for verification.

    1 Core is:

    module Collatz_conjecture(rst, clk, startn, count, done);
    input rst;
    input clk;
    input [127:0] startn;
    output reg [127:0] count;
    output reg done;
    reg [127:0] n;

    always_ff @(posedge clk)
    if (rst) begin
    n <= startn;
    count <= 0;
    done <= FALSE;
    end
    else begin
    if (!done) begin
    if (~n[0])
    n <= n >> 2'd1;
    else
    n <= n+n+n+1;
    if (n < startn)
    done <= TRUE;
    count <= count + 1;
    end
    end

    endmodule




    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Feb 21 12:27:38 2026
    From Newsgroup: comp.arch

    On 2/14/2026 1:18 PM, David Brown wrote:
    On 14/02/2026 19:48, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture.-a Graphics cards have sped this
    up enormously.

    (A reminder, if any is needed:-a From a starting number n, the
    transformation, recursively applied,


    -a-a-a f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    While n/2 remains even, n/2 is guaranteed to converge to 1.
    So, only the odd cases are interesting.

    3*n+1 is always even (since n was odd, and odd*odd is odd and
    odd+1 is even). So, a double iteration is (3*n+1)/2.

    Yes, that is often a micro-optimisation used.

    All the work is done on general-purpose hardware, which has many
    capabilities that are not needed, and consume area and power that
    special-purpose hardware would not need.-a Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    Induction would say it has been proven by then.

    Would you like to re-think that?-a "Induction" says nothing of the sort.

    The conjecture has been proven true for all numbers up to 2 ^ 71
    (assuming that is the current figure) - it has most definitely /not/
    been proven to be true for all numbers.-a There are other things proven about it, such as the asymptotic rarity of numbers that are exceptions
    to the pattern, but no one has any notion of how to prove it in general.

    Being true up to 2 ^ 71 makes it reasonable to believe it is always
    true, but that is very different from being /proven/.-a There are plenty
    of mathematical conjectures that are true up to very large numbers, but
    fail later on.


    Comments?

    Interesting problem,
    waste of power.

    It is a mathematically very interesting problem.-a Whether there is
    anything to be gained by continuing the search for exceptions by brute force, is a different matter - but /if/ it is worth doing, then doing so more efficiently is a good idea.

    FWIW:
    Is there any number likely to disprove it? No.
    Would running it out to some arbitrary precision prove it? No.

    Granted, finding a value where it failed would disprove it, but this is unlikely to happen.

    But, can note:
    For all positive even numbers, n/2 is less than n;
    For all positive odd numbers, n*3+1 is even;
    So, it becomes (n*3+1)/2
    For all positive even numbers:
    Probability of n/2 being an integer is 100%.
    Probability of n/2 being even is 50%.
    For (n*3+1)/2:
    Probability of this being even is 50%.
    So, 50% value gets bigger, 50% it gets smaller.
    Probability of a double odd: 25%
    Probability of a triple odd: 12.5%
    ...

    OK, so it would seem like the probability of for N finding a path that diverges, becomes 1/2^k, where k=log2(n).

    Where, the paths that get smaller can't include a path that gets bigger
    unless the current path itself can get bigger, where the probability of
    this happening quickly approaches 0.

    If there was any place likely to find a divergence, it would be for
    small integers, this is empirically known to be false.


    Granted, saying that the probability approaches 0 is not the same as
    saying that the probability is 0.

    The probability of it being 0 seems directly proportional to the reciprocal:
    If p>(1/n): A divergence would be expected.
    If p<(1/n): A divergence would be impossible.
    But, p=1/n, as (1/n)=(1/(2**log2(n)))

    Though, since n*3+1 is always even, it is p=1/(2**ceil(log2(n))).

    1/(2**ceil(log2(n))) < (1/n) for all non-even integers (would only be
    equal for powers of 2, but powers of 2 always converge).

    So, it would seem like probability of divergence converges on 0.

    Dunno, though, this is just how it seems to me at the moment.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sat Feb 21 20:36:51 2026
    From Newsgroup: comp.arch

    On Sat, 21 Feb 2026 01:56:03 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    On 2026-02-19 5:10 a.m., Michael S wrote:
    On Wed, 18 Feb 2026 22:56:19 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    On 2026-02-18 11:22 a.m., Michael S wrote:
    On Sun, 15 Feb 2026 10:56:51 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    I have coded a version that tests up to 8192 number
    simultaneously. It is supposed to be able to run at over 200
    MHz. Some of the number would take a large number of iterations
    though.

    I found something strange, the code to count for more numbers
    does not increase the size of the core very much. (But it takes
    longer to build.) I am not sure if this occurs because there is
    a mitsake somewhere, or if the tools are able to figure out how
    to share all the logic. I am thinking the logic for adders is
    shared somehow. 128-bit arithmetic is being used.


    There is a mistake in your code. No other explanation is possible.

    You should look for 3 sorts of mistakes
    1. Core itself.
    The most likely mistake is failing to look for one of the ending
    conditions. You have to look for 3 conditions:
    a) - value produced at current step is smaller than initial value.
    That is successful ending. Here you signal to the scheduler
    that you are ready to accept the next value.
    b) - overflow. Next value exceeds the width of register.
    c) - timeout. You were running for more than N steps. N=8192 looks
    like the reasonable value.
    a) and b) are suspected failures.
    The handling for b) and c) is similar - you raise alert signal and
    stop. Your test unit has to provide supervisor circuity with the
    mean to fetch the initial value that caused suspected failure and
    to clear alert condition.
    After alert is cleared your unit returns to the same state as
    after success - waiting for next initial value.

    I found the mistake. Core was working, but groups of cores were
    not. Done signals were missing.


    2. Scheduler.
    In proof-of-concept design the scheduler does not have to be very
    realistic. For example, it does not have to be efficiently
    pipelined. But it should be able to monitor ready signals of all
    test units and to supply different initial data to different
    units. Supplying the same data to all units simultaneously is a
    sort of mistake that explains your observed behavior.

    There is really simple scheduling ATM. There is just a bunch of
    done signals that are gated together. Meaning the slowest
    calculation of the group will slow everything else down. Once all
    the dones are true the next set of numbers are loaded.

    3. Supervisor
    Again, in proof-of-concept design the supervisor does not have to
    be very realistic. But it should be able to selectively exercise
    all core feature related to alert, including fetching of initial
    data and clearing of alert.

    After you coded everything correctly, do not even try to compile
    test with 8000 units. That is totally unrealistic. Even if you
    find huge FPGA in which such number of testers could fit (and I
    am not sure that such FPGA exists) the compilation will take many
    hours, possibly days.



    Yeah 8,000 was definitely out of the picture. I had since coded
    24x24 matrix (576) and although the logic would fit that turned
    out not to be able to route. So, it is down to 20x20. 20x20 with
    no real scheduling fits.

    I tried displaying progress as an on-screen bitmap, but that just
    appeared noisy. I have been working on an SCP so may be able to use
    that to manage the scheduling with some custom logic.




    Does 20x20 means 400 cores with something like 80-bit initial values
    and 128-bit processing?
    Yes. 128-bit processing.

    If you didn't make further mistakes then it should be rather big,
    order of 300K LEs. It would not fit in the biggest low-end Altera
    device (Cyclone 10GX 10CX220).

    IIRC it is about 150k LUTs. It uses the simplest approach, which is
    not very fast iteration wise, but it allows a lot of cores and a fast
    clock cycles time. One core is very small and has a fast iteration (1
    clock cycle). But it takes a lot more iterations than can be done
    with a more brainiac approach.

    Using a more brainiac approach would likely cut the performance in
    half and use a lot more LUTs.


    According to my experiments, combining 2 steps have very small negative
    effect on achivable clock and increases area by ~1.8x. So, to me it
    looks like a win.
    That's the combined step I am talking about:
    x = n/4;
    switch (x % 4) {
    case 0: n = x; break;
    case 1: n = 3*x+1; break;
    case 2: n = 2*x+2; break;
    case 3: n = 9*x+8; break;
    }



    On the Xilinx side, it would not fit in any Spartan and in most
    Artix devices. May be, it would fit in the biggest Artix
    UltraScale+ (AU25P). But more realistically for 400 test units one
    would want Arria 10 on Altera side or Kintex on Xilinx side. Both
    are not prohibitively expensive, but not cheap.

    What device are you using?
    I am not using an too inexpensive device. I have a Kintex-325 with
    about 200k LUTs.


    According to Xilinx marketing materials XC7K325T has 326080 equivalents
    of traditional LEs.
    That explains why your design fits.

    Pay attention that while Kintex evaluation boards could be
    non-expensive, right now buying devices for actual production work is
    rather costly, especially in small quantities.
    The lowest price I see on digikey site is 1147 USD. That's a bad value relatively to many other FPGAs.


    If it is smaller than Kintex KU5P then I'd strongly suspect that you
    didn't clean out all mistakes.



    There could be more mistakes for sure. But I am sure the lowest level
    core works. It seemed to in simulation. I made a slight mod to it
    since I tested it last, checking n < startn.
    Counts are not used so the logic would be stripped out, but its good
    for verification.

    1 Core is:

    module Collatz_conjecture(rst, clk, startn, count, done);
    input rst;
    input clk;
    input [127:0] startn;
    output reg [127:0] count;
    output reg done;
    reg [127:0] n;

    always_ff @(posedge clk)
    if (rst) begin
    n <= startn;
    count <= 0;
    done <= FALSE;
    end
    else begin
    if (!done) begin
    if (~n[0])
    n <= n >> 2'd1;
    else
    n <= n+n+n+1;
    if (n < startn)
    done <= TRUE;
    count <= count + 1;
    end
    end

    endmodule



    Ok.
    1. Your core misses one critical part - overflow check.
    Assuming 80-bit initial count, 128-bit register will overflow
    eventually. It would happen very rarely, but it would happen.
    The check for overflow does not have to be precise. Conservative
    approach, like looking at 2 MS bits and one LS bit and signaling
    overflow not when it really happened but when it *could* happen, i.e.
    when MS bits > 0 and LS bit='1' is good enough. But the check can't be
    omitted.

    2. I see that instead of calculating n = (n*3+1)/2 as one step you are
    doing it in two separate steps. I am not sure that it is really makes
    your core smaller or faster. And it certainly increases # of iterations
    by 1/3rd.

    3.
    128-bit count is pointless loss of resources. 12 or 13 bit are
    sufficient. If the count exceeded that size then you already found
    something of high interest and it is a job of supervisor to find out
    what exactly had happened.
    128-bit counter actually not just too big, it is also pointless,
    because if we pretend to believe that our purpose is to disprove
    Collatz's conjecture then supervisor would have to implement stop
    condition with some sort of timeout that will effectively duplicate functionality of the counter.

    4.
    Your module does not latch the value of startn. It means that caller
    module has to do it on it's behalf. That is the most likely place where
    one can make a mistake.

    5.
    Coding style comment. Normally, name rst is used for asynchronous
    reset signal. In your design rst is a synchronous signal. For the
    benefit of casual reader I suggest to find different name.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Feb 21 13:48:02 2026
    From Newsgroup: comp.arch

    On 2/15/2026 5:50 AM, David Brown wrote:
    On 14/02/2026 22:22, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 14/02/2026 19:48, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture.-a Graphics cards have sped this
    up enormously.

    (A reminder, if any is needed:-a From a starting number n, the
    transformation, recursively applied,


    -a-a-a-a f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    While n/2 remains even, n/2 is guaranteed to converge to 1.
    So, only the odd cases are interesting.

    3*n+1 is always even (since n was odd, and odd*odd is odd and
    odd+1 is even). So, a double iteration is (3*n+1)/2.

    Yes, that is often a micro-optimisation used.

    All the work is done on general-purpose hardware, which has many
    capabilities that are not needed, and consume area and power that
    special-purpose hardware would not need.-a Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    Induction would say it has been proven by then.

    Would you like to re-think that?-a "Induction" says nothing of the sort. >>>
    The conjecture has been proven true for all numbers up to 2 ^ 71
    (assuming that is the current figure) - it has most definitely /not/
    been proven to be true for all numbers.-a There are other things proven
    about it, such as the asymptotic rarity of numbers that are exceptions
    to the pattern, but no one has any notion of how to prove it in general.

    By proving up to 2^71, you simultaneously prove that it cannot be proven
    by computer-like arithmetic processes. What if the number was
    2^71,000,000
    you still have not proven it? Thereby, it is not provable by simple
    computer
    arithmetic processes.


    All that has been proven is that if the conjecture is not true, then the smallest exception to the pattern is greater than 2 ^ 71.-a Maybe it is 2
    ^ 71 + 1.-a Current computer technology could take the search a lot
    further than just 2 ^ 71 - if someone wanted to pay for it!-a Future computer technology could perhaps greatly increase the bounds.-a Maybe Thomas's "useless machine" could lead to far greater bounds without excessive costs - and maybe research into that could lead to useful new technology or techniques with applications beyond an interesting maths problem.

    There are lots of examples of conjectures where the first exception is a high number, and all testing of low numbers fitted the conjectures. Sometimes such counter-examples are far beyond the reach of any brute-
    force computational testing, but sometimes not.

    We have not proven that the Collatz Conjecture can be disproven by brute-force testing - we have merely shown that doing so would be
    difficult, and seems unlikely to be successful.-a And we have known all along that computational methods can never prove a conjecture like this
    to be true - that much should be obvious.

    Testing conjectures like this can be helpful to mathematicians.
    Sometimes they can show patterns that can be used to get more insights - maybe there is a pattern in the numbers that lead to peaks of growth of
    the Collatz function before dropping towards 1.-a Such patterns can
    inspire proofs - or more targeted searches for exceptions.

    Ergo, you need a different means.

    You don't imagine that the only thing mathematicians have done with the Collatz Conjecture is test it to large numbers, do you?


    In theory, large numbers could disprove it, if an exception were found. However, it fails to prove it either; and if no value is found (or would
    be found) it doesn't prove anything.


    Being true up to 2 ^ 71 makes it reasonable to believe it is always
    true, but that is very different from being /proven/.-a There are plenty >>> of mathematical conjectures that are true up to very large numbers, but
    fail later on.

    You need cannot even /prove/ it using BigNum computer arithmetic in
    finite
    time.

    Surely everyone who has done high-school mathematics already realises
    that a conjecture like this cannot be proven by trail and error alone? Brute-force testing is about looking for exceptions to /disprove/ the conjecture, or gaining insights that can lead to its proof or disproof.


    Yeah, pretty much...

    My own thinking (mostly since looking at this thread, earlier today) has
    led me to realize first that the probability of any number containing a divergent path is less than the reciprocal of the value (related to
    prior post in this thread).

    Then, further realized that, since one doesn't just need *one* number
    with the probability of a divergent path, one needs an infinite series
    (since any non-divergent path immediately collapses back into a
    convergent path).

    Effectively, one has to multiply an infinite series of near 0
    probabilities. Or, effectively, for all values of n, the effective
    probability of a non-zero probability converges to 1/Inf, or, 0...


    Probably doesn't count as a sufficient proof by standards of
    mathematical rigor, but at least for me, I can say that it seems true
    enough that it always converges (and, will always continue to converge
    no matter how big the numbers get).

    While in some hand wavy sense, one could try to push the probability to infinity to defeat the infinite reciprocal, infinity does not exist
    within the set of positive integers...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Feb 21 15:39:42 2026
    From Newsgroup: comp.arch

    Robert Finch wrote:
    On 2026-02-19 5:10 a.m., Michael S wrote:

    If it is smaller than Kintex KU5P then I'd strongly suspect that you
    didn't clean out all mistakes.



    There could be more mistakes for sure. But I am sure the lowest level
    core works. It seemed to in simulation. I made a slight mod to it since
    I tested it last, checking n < startn.
    Counts are not used so the logic would be stripped out, but its good for verification.

    1 Core is:

    module Collatz_conjecture(rst, clk, startn, count, done);
    input rst;
    input clk;
    input [127:0] startn;
    output reg [127:0] count;
    output reg done;
    reg [127:0] n;

    always_ff @(posedge clk)
    if (rst) begin
    n <= startn;
    count <= 0;
    done <= FALSE;
    end
    else begin
    if (!done) begin
    if (~n[0])
    n <= n >> 2'd1;
    else
    n <= n+n+n+1;
    if (n < startn)
    done <= TRUE;
    count <= count + 1;
    end
    end

    endmodule

    I have a minimal knowledge of Verilog and may be wrong but...

    - it looks to me in the code as written that startn, count and done args
    are coded either input or output, but are actually both read and written.
    The compiler should have given an error as that implies multiple drivers
    for those wires (the caller module and the assignments).
    Possibly it didn't detect an error because it defaults to data type
    "logic" which has 4 states, one of which is tri-state 'z'.

    Maybe setting the data types to 'bit' would allow it to detect the error.

    Changing them to "inout" would turn them all into tri-state buses which
    is probably not what you want.
    Possibly have separate args for input and output:

    input bit[127:0] iCount;
    output bit]127:0] oCount.

    - I don't know how smart your compiler is but the >> shift operator can
    be replaced by a bit slice operation that only routes wires:
    if (~n[0])
    n <= {1'b0, n[127:1]};

    - it can eliminate a 128 bit adder by shifting the wires left:
    else
    n <= {n[126:0], 1'b0} + n + 1;




    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Feb 21 21:40:26 2026
    From Newsgroup: comp.arch

    BGB wrote:
    On 2/15/2026 5:50 AM, David Brown wrote:
    On 14/02/2026 22:22, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 14/02/2026 19:48, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Some people are throwing large amounts of computing power at trying >>>>>> to disprove the Collatz conjecture.|e-a Graphics cards have sped this >>>>>> up enormously.

    (A reminder, if any is needed:|e-a From a starting number n, the
    transformation, recursively applied,


    |e-a|e-a|e-a|e-a f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    While n/2 remains even, n/2 is guaranteed to converge to 1.
    So, only the odd cases are interesting.

    3*n+1 is always even (since n was odd, and odd*odd is odd and
    odd+1 is even). So, a double iteration is (3*n+1)/2.

    Yes, that is often a micro-optimisation used.

    All the work is done on general-purpose hardware, which has many
    capabilities that are not needed, and consume area and power that
    special-purpose hardware would not need.|e-a Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    Induction would say it has been proven by then.

    Would you like to re-think that?|e-a "Induction" says nothing of the
    sort.

    The conjecture has been proven true for all numbers up to 2 ^ 71
    (assuming that is the current figure) - it has most definitely /not/>>>> been proven to be true for all numbers.|e-a There are other things proven
    about it, such as the asymptotic rarity of numbers that are exceptions >>>> to the pattern, but no one has any notion of how to prove it in
    general.

    By proving up to 2^71, you simultaneously prove that it cannot be proven >>> by computer-like arithmetic processes. What if the number was
    2^71,000,000
    you still have not proven it? Thereby, it is not provable by simple
    computer
    arithmetic processes.


    All that has been proven is that if the conjecture is not true, then
    the smallest exception to the pattern is greater than 2 ^ 71.|e-a Maybe
    it is 2 ^ 71 + 1.|e-a Current computer technology could take the search
    a lot further than just 2 ^ 71 - if someone wanted to pay for it!|e
    Future computer technology could perhaps greatly increase the
    bounds.|e-a Maybe Thomas's "useless machine" could lead to far greater
    bounds without excessive costs - and maybe research into that could
    lead to useful new technology or techniques with applications beyond
    an interesting maths problem.

    There are lots of examples of conjectures where the first exception is
    a high number, and all testing of low numbers fitted the conjectures. >> Sometimes such counter-examples are far beyond the reach of any brute-
    force computational testing, but sometimes not.

    We have not proven that the Collatz Conjecture can be disproven by
    brute-force testing - we have merely shown that doing so would be
    difficult, and seems unlikely to be successful.|e-a And we have known
    all along that computational methods can never prove a conjecture like
    this to be true - that much should be obvious.

    Testing conjectures like this can be helpful to mathematicians.
    Sometimes they can show patterns that can be used to get more insights
    - maybe there is a pattern in the numbers that lead to peaks of growth
    of the Collatz function before dropping towards 1.|e-a Such patterns can
    inspire proofs - or more targeted searches for exceptions.

    Ergo, you need a different means.

    You don't imagine that the only thing mathematicians have done with
    the Collatz Conjecture is test it to large numbers, do you?


    In theory, large numbers could disprove it, if an exception were found.> However, it fails to prove it either; and if no value is found (or would
    be found) it doesn't prove anything.


    Being true up to 2 ^ 71 makes it reasonable to believe it is always
    true, but that is very different from being /proven/.|e-a There are
    plenty
    of mathematical conjectures that are true up to very large numbers, but >>>> fail later on.

    You need cannot even /prove/ it using BigNum computer arithmetic in
    finite
    time.

    Surely everyone who has done high-school mathematics already realises >> that a conjecture like this cannot be proven by trail and error alone?
    Brute-force testing is about looking for exceptions to /disprove/ the >> conjecture, or gaining insights that can lead to its proof or disproof.


    Yeah, pretty much...

    My own thinking (mostly since looking at this thread, earlier today) has
    led me to realize first that the probability of any number containing a divergent path is less than the reciprocal of the value (related to
    prior post in this thread).

    Then, further realized that, since one doesn't just need *one* number
    with the probability of a divergent path, one needs an infinite series > (since any non-divergent path immediately collapses back into a
    convergent path).
    You do have to detect an infinite loop, but that's easy to do:
    Even if you don't employ the classic one-step/two-step approach, simply stopping after 8k iterations and then let the supervisor check it
    properly would be fine.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Feb 21 15:17:54 2026
    From Newsgroup: comp.arch

    On 2/21/2026 2:40 PM, Terje Mathisen wrote:
    BGB wrote:
    On 2/15/2026 5:50 AM, David Brown wrote:
    On 14/02/2026 22:22, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 14/02/2026 19:48, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Some people are throwing large amounts of computing power at trying >>>>>>> to disprove the Collatz conjecture.|e-a Graphics cards have sped this >>>>>>> up enormously.

    (A reminder, if any is needed:|e-a From a starting number n, the >>>>>>> transformation, recursively applied,


    |e-a|e-a|e-a|e-a f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    While n/2 remains even, n/2 is guaranteed to converge to 1.
    So, only the odd cases are interesting.

    3*n+1 is always even (since n was odd, and odd*odd is odd and
    odd+1 is even). So, a double iteration is (3*n+1)/2.

    Yes, that is often a micro-optimisation used.

    All the work is done on general-purpose hardware, which has many >>>>>>> capabilities that are not needed, and consume area and power that >>>>>>> special-purpose hardware would not need.|e-a Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    Induction would say it has been proven by then.

    Would you like to re-think that?|e-a "Induction" says nothing of the >>>>> sort.

    The conjecture has been proven true for all numbers up to 2 ^ 71
    (assuming that is the current figure) - it has most definitely /not/ >>>>> been proven to be true for all numbers.|e-a There are other things
    proven
    about it, such as the asymptotic rarity of numbers that are exceptions >>>>> to the pattern, but no one has any notion of how to prove it in
    general.

    By proving up to 2^71, you simultaneously prove that it cannot be
    proven
    by computer-like arithmetic processes. What if the number was
    2^71,000,000
    you still have not proven it? Thereby, it is not provable by simple
    computer
    arithmetic processes.


    All that has been proven is that if the conjecture is not true, then
    the smallest exception to the pattern is greater than 2 ^ 71.|e-a Maybe >>> it is 2 ^ 71 + 1.|e-a Current computer technology could take the search >>> a lot further than just 2 ^ 71 - if someone wanted to pay for it!|e
    Future computer technology could perhaps greatly increase the
    bounds.|e-a Maybe Thomas's "useless machine" could lead to far greater
    bounds without excessive costs - and maybe research into that could
    lead to useful new technology or techniques with applications beyond
    an interesting maths problem.

    There are lots of examples of conjectures where the first exception
    is a high number, and all testing of low numbers fitted the
    conjectures. Sometimes such counter-examples are far beyond the reach
    of any brute- force computational testing, but sometimes not.

    We have not proven that the Collatz Conjecture can be disproven by
    brute-force testing - we have merely shown that doing so would be
    difficult, and seems unlikely to be successful.|e-a And we have known
    all along that computational methods can never prove a conjecture
    like this to be true - that much should be obvious.

    Testing conjectures like this can be helpful to mathematicians.
    Sometimes they can show patterns that can be used to get more
    insights - maybe there is a pattern in the numbers that lead to peaks
    of growth of the Collatz function before dropping towards 1.|e-a Such
    patterns can inspire proofs - or more targeted searches for exceptions.

    Ergo, you need a different means.

    You don't imagine that the only thing mathematicians have done with
    the Collatz Conjecture is test it to large numbers, do you?


    In theory, large numbers could disprove it, if an exception were found.
    However, it fails to prove it either; and if no value is found (or
    would be found) it doesn't prove anything.


    Being true up to 2 ^ 71 makes it reasonable to believe it is always
    true, but that is very different from being /proven/.|e-a There are >>>>> plenty
    of mathematical conjectures that are true up to very large numbers, >>>>> but
    fail later on.

    You need cannot even /prove/ it using BigNum computer arithmetic in
    finite
    time.

    Surely everyone who has done high-school mathematics already realises
    that a conjecture like this cannot be proven by trail and error
    alone? Brute-force testing is about looking for exceptions to /
    disprove/ the conjecture, or gaining insights that can lead to its
    proof or disproof.


    Yeah, pretty much...

    My own thinking (mostly since looking at this thread, earlier today)
    has led me to realize first that the probability of any number
    containing a divergent path is less than the reciprocal of the value
    (related to prior post in this thread).

    Then, further realized that, since one doesn't just need *one* number
    with the probability of a divergent path, one needs an infinite series
    (since any non-divergent path immediately collapses back into a
    convergent path).

    You do have to detect an infinite loop, but that's easy to do:
    Even if you don't employ the classic one-step/two-step approach, simply stopping after 8k iterations and then let the supervisor check it
    properly would be fine.


    Yeah, assuming it is being tested.

    Another option could be to detect an overflow from whatever size of
    integer, assuming the starting point is not within a region where near-immediate overflow is likely.


    Seemingly the definitions on Wikipedia don't appear to exclude n==0,
    which could lead to an infinite loop: 0 is even, and 0/2==0. Can
    probably assume 0 is excluded because that would make it pointlessly
    trivial (even if they say, say, "positive integer" rather than "counting number" or similar, where the set of positive integers includes 0, but "counting numbers" do not).

    But, yeah, it would be false if the input set included 0.




    Terje


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 22 03:16:58 2026
    From Newsgroup: comp.arch

    On Sat, 21 Feb 2026 15:39:42 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Robert Finch wrote:
    On 2026-02-19 5:10 a.m., Michael S wrote:

    If it is smaller than Kintex KU5P then I'd strongly suspect that
    you didn't clean out all mistakes.



    There could be more mistakes for sure. But I am sure the lowest
    level core works. It seemed to in simulation. I made a slight mod
    to it since I tested it last, checking n < startn.
    Counts are not used so the logic would be stripped out, but its
    good for verification.

    1 Core is:

    module Collatz_conjecture(rst, clk, startn, count, done);
    input rst;
    input clk;
    input [127:0] startn;
    output reg [127:0] count;
    output reg done;
    reg [127:0] n;

    always_ff @(posedge clk)
    if (rst) begin
    n <= startn;
    count <= 0;
    done <= FALSE;
    end
    else begin
    if (!done) begin
    if (~n[0])
    n <= n >> 2'd1;
    else
    n <= n+n+n+1;
    if (n < startn)
    done <= TRUE;
    count <= count + 1;
    end
    end

    endmodule

    I have a minimal knowledge of Verilog and may be wrong but...

    I am not a Verilog expert either (use VHDL exclusively), but to me the
    code of Robert Finch looks o.k from the HDL perspective (functionality
    is something else, I addressed it in my replay below).
    His Verilog appears to be an exact equivalent of VHDL code below,
    which is perfectly legal synthesizable code under VHDL-2008 rules.


    library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

    entity Collatz_conjecture is
    port (
    rst : in boolean;
    clk : in std_logic;
    startn : in unsigned(127 downto 0);
    count : out unsigned(127 downto 0);
    done : out boolean
    );
    end entity Collatz_conjecture;

    architecture a of Collatz_conjecture is
    signal n : unsigned(127 downto 0);
    begin
    process (clk)
    begin
    if rising_edge(clk) then
    if rst then
    n <= startn;
    count <= (others => '0');
    done <= false;
    elsif not done then
    if n(0)='0' then
    n <= '0' & n(n'high downto 1);
    else
    n <= n+n+n+1;
    end if;

    if n < startn then
    done <= true;
    end if;

    count <= count + 1;
    end if;
    end if;
    end process;
    end architecture a;


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 22 10:46:00 2026
    From Newsgroup: comp.arch

    On Sat, 21 Feb 2026 20:36:51 +0200
    Michael S <already5chosen@yahoo.com> wrote:


    Using a more brainiac approach would likely cut the performance in
    half and use a lot more LUTs.


    According to my experiments, combining 2 steps have very small
    negative effect on achivable clock and increases area by ~1.8x. So,
    to me it looks like a win.
    That's the combined step I am talking about:
    x = n/4;
    switch (x % 4) {
    case 0: n = x; break;
    case 1: n = 3*x+1; break;
    case 2: n = 2*x+2; break;
    case 3: n = 9*x+8; break;
    }




    Mistake in C above. Should be
    switch (n % 4) {

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Feb 22 04:50:11 2026
    From Newsgroup: comp.arch

    On 2/21/2026 3:17 PM, BGB wrote:
    On 2/21/2026 2:40 PM, Terje Mathisen wrote:
    BGB wrote:
    On 2/15/2026 5:50 AM, David Brown wrote:
    On 14/02/2026 22:22, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 14/02/2026 19:48, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Some people are throwing large amounts of computing power at trying >>>>>>>> to disprove the Collatz conjecture.|e-a Graphics cards have sped this >>>>>>>> up enormously.

    (A reminder, if any is needed:|e-a From a starting number n, the >>>>>>>> transformation, recursively applied,


    |e-a|e-a|e-a|e-a f(n) = 3*n+1, if n is odd; =n/2 if n is even. >>>>>>>>
    is conjectured to always reach 1 for any positive integer 1).

    While n/2 remains even, n/2 is guaranteed to converge to 1.
    So, only the odd cases are interesting.

    3*n+1 is always even (since n was odd, and odd*odd is odd and
    odd+1 is even). So, a double iteration is (3*n+1)/2.

    Yes, that is often a micro-optimisation used.

    All the work is done on general-purpose hardware, which has many >>>>>>>> capabilities that are not needed, and consume area and power that >>>>>>>> special-purpose hardware would not need.|e-a Also, the hardware >>>>>>>> is limited to 64-bit integers, and the range of tested numbers >>>>>>>> is now up to 2^71, so

    Induction would say it has been proven by then.

    Would you like to re-think that?|e-a "Induction" says nothing of the >>>>>> sort.

    The conjecture has been proven true for all numbers up to 2 ^ 71
    (assuming that is the current figure) - it has most definitely /not/ >>>>>> been proven to be true for all numbers.|e-a There are other things >>>>>> proven
    about it, such as the asymptotic rarity of numbers that are
    exceptions
    to the pattern, but no one has any notion of how to prove it in
    general.

    By proving up to 2^71, you simultaneously prove that it cannot be
    proven
    by computer-like arithmetic processes. What if the number was
    2^71,000,000
    you still have not proven it? Thereby, it is not provable by simple >>>>> computer
    arithmetic processes.


    All that has been proven is that if the conjecture is not true, then
    the smallest exception to the pattern is greater than 2 ^ 71.|e
    Maybe it is 2 ^ 71 + 1.|e-a Current computer technology could take the >>>> search a lot further than just 2 ^ 71 - if someone wanted to pay for
    it!|e Future computer technology could perhaps greatly increase the
    bounds.|e-a Maybe Thomas's "useless machine" could lead to far greater >>>> bounds without excessive costs - and maybe research into that could
    lead to useful new technology or techniques with applications beyond
    an interesting maths problem.

    There are lots of examples of conjectures where the first exception
    is a high number, and all testing of low numbers fitted the
    conjectures. Sometimes such counter-examples are far beyond the
    reach of any brute- force computational testing, but sometimes not.

    We have not proven that the Collatz Conjecture can be disproven by
    brute-force testing - we have merely shown that doing so would be
    difficult, and seems unlikely to be successful.|e-a And we have known >>>> all along that computational methods can never prove a conjecture
    like this to be true - that much should be obvious.

    Testing conjectures like this can be helpful to mathematicians.
    Sometimes they can show patterns that can be used to get more
    insights - maybe there is a pattern in the numbers that lead to
    peaks of growth of the Collatz function before dropping towards 1.|e >>>> Such patterns can inspire proofs - or more targeted searches for
    exceptions.

    Ergo, you need a different means.

    You don't imagine that the only thing mathematicians have done with
    the Collatz Conjecture is test it to large numbers, do you?


    In theory, large numbers could disprove it, if an exception were found.
    However, it fails to prove it either; and if no value is found (or
    would be found) it doesn't prove anything.


    Being true up to 2 ^ 71 makes it reasonable to believe it is always >>>>>> true, but that is very different from being /proven/.|e-a There are >>>>>> plenty
    of mathematical conjectures that are true up to very large
    numbers, but
    fail later on.

    You need cannot even /prove/ it using BigNum computer arithmetic in >>>>> finite
    time.

    Surely everyone who has done high-school mathematics already
    realises that a conjecture like this cannot be proven by trail and
    error alone? Brute-force testing is about looking for exceptions
    to / disprove/ the conjecture, or gaining insights that can lead to
    its proof or disproof.


    Yeah, pretty much...

    My own thinking (mostly since looking at this thread, earlier today)
    has led me to realize first that the probability of any number
    containing a divergent path is less than the reciprocal of the value
    (related to prior post in this thread).

    Then, further realized that, since one doesn't just need *one* number
    with the probability of a divergent path, one needs an infinite
    series (since any non-divergent path immediately collapses back into
    a convergent path).

    You do have to detect an infinite loop, but that's easy to do:
    Even if you don't employ the classic one-step/two-step approach,
    simply stopping after 8k iterations and then let the supervisor check
    it properly would be fine.


    Yeah, assuming it is being tested.

    Another option could be to detect an overflow from whatever size of
    integer, assuming the starting point is not within a region where near- immediate overflow is likely.


    Seemingly the definitions on Wikipedia don't appear to exclude n==0,
    which could lead to an infinite loop: 0 is even, and 0/2==0. Can
    probably assume 0 is excluded because that would make it pointlessly
    trivial (even if they say, say, "positive integer" rather than "counting number" or similar, where the set of positive integers includes 0, but "counting numbers" do not).

    But, yeah, it would be false if the input set included 0.


    I realized a detail that I missed earlier:
    What you were likely trying to point out was not about whether or not
    cases exist which shoot towards +Inf, but rather whether one can prove
    that no paths can exist which loop back on themselves.


    I guess, the latter is harder to prove.


    I can see something visually:
    In this case, the number line can be seen as a spiral along a cone where
    each turn represents a power of 2 (or twice the circumference of the
    prior turn).

    So, in this case, dividing by 2 drops 1 turn, or (if it reaches the
    point, it reaches 1).

    3*n+1, effectively moves 1.5+1 turns up, but then immediately drops down
    1 turn, so in effect it moves by 0.5 turn, and but the +1 had
    effectively moved it to the next position within this turn.

    This means, that for a cycle to exist, there would need to exist a turn
    where the loop proceeds through every position within that turn. But,
    this in turn can only happen if, the next level up advances through
    every position in that level (to form a zigzag pattern across the
    levels), with this pattern existing towards +Inf, but this pattern could
    only exist if *no* paths converge, so in effect this path does not exist.


    Or, in effect, since it is achieving a 1/2 turn +1 in phase space, there
    will not exist any closed loops.

    Essentially it is the same basic reason why no two adjacent spots can
    collide on the same index within modulo indexing.


    In effect, the whole space turns into non-intersecting spiral paths
    headed toward 1 (and, when paths merge, all roads lead to 1).


    Probably doesn't really count as a proof though...





    Terje



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Feb 22 12:45:44 2026
    From Newsgroup: comp.arch

    Michael S wrote:
    On Sat, 21 Feb 2026 20:36:51 +0200
    Michael S <already5chosen@yahoo.com> wrote:


    Using a more brainiac approach would likely cut the performance in
    half and use a lot more LUTs.


    According to my experiments, combining 2 steps have very small
    negative effect on achivable clock and increases area by ~1.8x. So,
    to me it looks like a win.
    That's the combined step I am talking about:
    x = n/4;
    switch (x % 4) {
    case 0: n = x; break;
    case 1: n = 3*x+1; break;
    case 2: n = 2*x+2; break;
    case 3: n = 9*x+8; break;
    }




    Mistake in C above. Should be
    switch (n % 4) {

    I did notice that. :-)

    The one thing that worries me (sw on a 64-bit platform) about the code
    is the 9* on 128-bit variables:

    9*x+8 =>

    Do we use SHLD + SHL here or something else?

    How about MUL & LEA?

    ;; Input in r10:r9, output in rdx:rax
    mov rax,r9
    mul rax,rbx ;; RBX == 9
    lea r10,[r10+r10*8]
    add rdx,r10

    That looks like 5-6 clock cycles, so the branch misses from the switch statement would probably dominate unless you do as I suggested and use
    lookup tables instead:

    let bot2 = n & 3;
    let x = n >> 2;
    n = x*multab[bot2] + addtab[bot2];

    but if we do that, then (at least for a sw implementation) it would be
    better to pick a lot more of the LS bits, at least 8-12?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sun Feb 22 14:43:11 2026
    From Newsgroup: comp.arch

    On 21/02/2026 20:48, BGB wrote:
    On 2/15/2026 5:50 AM, David Brown wrote:
    On 14/02/2026 22:22, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 14/02/2026 19:48, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Some people are throwing large amounts of computing power at trying >>>>>> to disprove the Collatz conjecture.-a Graphics cards have sped this >>>>>> up enormously.

    (A reminder, if any is needed:-a From a starting number n, the
    transformation, recursively applied,


    -a-a-a-a f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    While n/2 remains even, n/2 is guaranteed to converge to 1.
    So, only the odd cases are interesting.

    3*n+1 is always even (since n was odd, and odd*odd is odd and
    odd+1 is even). So, a double iteration is (3*n+1)/2.

    Yes, that is often a micro-optimisation used.

    All the work is done on general-purpose hardware, which has many
    capabilities that are not needed, and consume area and power that
    special-purpose hardware would not need.-a Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    Induction would say it has been proven by then.

    Would you like to re-think that?-a "Induction" says nothing of the sort. >>>>
    The conjecture has been proven true for all numbers up to 2 ^ 71
    (assuming that is the current figure) - it has most definitely /not/
    been proven to be true for all numbers.-a There are other things proven >>>> about it, such as the asymptotic rarity of numbers that are exceptions >>>> to the pattern, but no one has any notion of how to prove it in
    general.

    By proving up to 2^71, you simultaneously prove that it cannot be proven >>> by computer-like arithmetic processes. What if the number was
    2^71,000,000
    you still have not proven it? Thereby, it is not provable by simple
    computer
    arithmetic processes.


    All that has been proven is that if the conjecture is not true, then
    the smallest exception to the pattern is greater than 2 ^ 71.-a Maybe
    it is 2 ^ 71 + 1.-a Current computer technology could take the search a
    lot further than just 2 ^ 71 - if someone wanted to pay for it!
    Future computer technology could perhaps greatly increase the bounds.
    Maybe Thomas's "useless machine" could lead to far greater bounds
    without excessive costs - and maybe research into that could lead to
    useful new technology or techniques with applications beyond an
    interesting maths problem.

    There are lots of examples of conjectures where the first exception is
    a high number, and all testing of low numbers fitted the conjectures.
    Sometimes such counter-examples are far beyond the reach of any brute-
    force computational testing, but sometimes not.

    We have not proven that the Collatz Conjecture can be disproven by
    brute-force testing - we have merely shown that doing so would be
    difficult, and seems unlikely to be successful.-a And we have known all
    along that computational methods can never prove a conjecture like
    this to be true - that much should be obvious.

    Testing conjectures like this can be helpful to mathematicians.
    Sometimes they can show patterns that can be used to get more insights
    - maybe there is a pattern in the numbers that lead to peaks of growth
    of the Collatz function before dropping towards 1.-a Such patterns can
    inspire proofs - or more targeted searches for exceptions.

    Ergo, you need a different means.

    You don't imagine that the only thing mathematicians have done with
    the Collatz Conjecture is test it to large numbers, do you?


    In theory, large numbers could disprove it, if an exception were found. However, it fails to prove it either; and if no value is found (or would
    be found) it doesn't prove anything.


    Being true up to 2 ^ 71 makes it reasonable to believe it is always
    true, but that is very different from being /proven/.-a There are plenty >>>> of mathematical conjectures that are true up to very large numbers, but >>>> fail later on.

    You need cannot even /prove/ it using BigNum computer arithmetic in
    finite
    time.

    Surely everyone who has done high-school mathematics already realises
    that a conjecture like this cannot be proven by trail and error alone?
    Brute-force testing is about looking for exceptions to /disprove/ the
    conjecture, or gaining insights that can lead to its proof or disproof.


    Yeah, pretty much...

    My own thinking (mostly since looking at this thread, earlier today) has
    led me to realize first that the probability of any number containing a divergent path is less than the reciprocal of the value (related to
    prior post in this thread).


    Mathematicians don't consider someone's gut feelings as proof. At best,
    it can be an inspiration about what to focus on. Random guesses about
    what someone things is the probability of something are rarely of any
    use to anyone.

    It does not even make sense to talk about "probability" here - a number
    either leads to a divergent path, a convergent path ending in 1, or a different cycle. It is not a random process.

    You can make statements about the density of numbers that do not lead to
    1, or limits to the density. There has been a fair amount of work done
    on the Collatz Conjecture in this way, putting limits on the proportion
    of numbers that are divergent or lead to other cycles (it turns out they
    are very rare, if they exist at all). But those limits do not come from guesses, they come from proofs.

    Then, further realized that, since one doesn't just need *one* number
    with the probability of a divergent path, one needs an infinite series (since any non-divergent path immediately collapses back into a
    convergent path).

    You can't prove that a number leads to a divergent path by trial and
    error, even if you find such a number. But if someone found a number
    that appeared to lead to a divergent path, it may be possible to prove
    that it diverges by other methods. If alternative cycles exist, then
    they could be found by trial and error (though no one is hopeful that
    this will be practically possible).


    Effectively, one has to multiply an infinite series of near 0
    probabilities. Or, effectively, for all values of n, the effective probability of a non-zero probability converges to 1/Inf, or, 0...


    Such arguments are invariably nonsense unless you qualify them far
    better - especially when "probability" has no meaning here.


    Probably doesn't count as a sufficient proof by standards of
    mathematical rigor, but at least for me, I can say that it seems true
    enough that it always converges (and, will always continue to converge
    no matter how big the numbers get).


    You are correct that it doesn't count as proof.

    While in some hand wavy sense, one could try to push the probability to infinity to defeat the infinite reciprocal, infinity does not exist
    within the set of positive integers...



    Let me give another argument in the same line as yours - with the same
    level of "hand-wavy sense" and the same level of completely invalid mathematics.

    With a ruler (unmarked) and compass, you can construct lots of things by algorithmic sequences. You can, for example, bisect an arbitrary angle.
    You can construct an exact 65537 sided polynomial. So if we take a
    random sequence of N steps, maybe it can be used to trisect an arbitrary angle. It's unlikely, but it is obvious that it is plausible. So we
    can say that there is a probability "p(N)" that any sequence of length N
    will trisect an arbitrary angle. No one has found such a trisector
    algorithm yet, so we know p(N) is small. However, it is also obvious
    that as the number of steps grows, so does this probability (always
    limited by 1).

    So now consider q(N) = 1 - p(N), the probability that no sequence of
    length N an trisect an angle. As we have reasoned, q(N) will start at 1
    (you can't trisect an angle in one step), but will drop below 1 and
    decrease. The probability that /no/ sequence of steps will be able to
    trisect an angle is thus the product of all q(N). It's a product of a decreasing sequence of numbers between 1 and 0 - clearly it converges to
    0. Thus at some point, there has to be a sequence of steps that can
    trisect an angle - we have proved its existence, even though we have not figured out how to do it.


    The problem with this argument is that it is bollocks. Real mathematics
    has proved that trisection of arbitrary angles by ruler and compass is impossible.

    We don't yet know if the Collatz Conjecture is true or not. But your "probability argument" is of no more worth there than it is for
    trisecting angles.





    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sun Feb 22 15:14:31 2026
    From Newsgroup: comp.arch

    On 21/02/2026 19:27, BGB wrote:
    On 2/14/2026 1:18 PM, David Brown wrote:
    On 14/02/2026 19:48, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Some people are throwing large amounts of computing power at trying
    to disprove the Collatz conjecture.-a Graphics cards have sped this
    up enormously.

    (A reminder, if any is needed:-a From a starting number n, the
    transformation, recursively applied,


    -a-a-a f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    While n/2 remains even, n/2 is guaranteed to converge to 1.
    So, only the odd cases are interesting.

    3*n+1 is always even (since n was odd, and odd*odd is odd and
    odd+1 is even). So, a double iteration is (3*n+1)/2.

    Yes, that is often a micro-optimisation used.

    All the work is done on general-purpose hardware, which has many
    capabilities that are not needed, and consume area and power that
    special-purpose hardware would not need.-a Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    Induction would say it has been proven by then.

    Would you like to re-think that?-a "Induction" says nothing of the sort.

    The conjecture has been proven true for all numbers up to 2 ^ 71
    (assuming that is the current figure) - it has most definitely /not/
    been proven to be true for all numbers.-a There are other things proven
    about it, such as the asymptotic rarity of numbers that are exceptions
    to the pattern, but no one has any notion of how to prove it in general.

    Being true up to 2 ^ 71 makes it reasonable to believe it is always
    true, but that is very different from being /proven/.-a There are
    plenty of mathematical conjectures that are true up to very large
    numbers, but fail later on.


    Comments?

    Interesting problem,
    waste of power.

    It is a mathematically very interesting problem.-a Whether there is
    anything to be gained by continuing the search for exceptions by brute
    force, is a different matter - but /if/ it is worth doing, then doing
    so more efficiently is a good idea.

    FWIW:
    Is there any number likely to disprove it? No.

    What justification do you have for that claim?

    There's a big difference between saying that tests on sample numbers
    have not given any sign of a way to find exceptions to the conjecture,
    and talking about what is "likely" or not.

    If a number is found that has a different cycle, or a number is found
    that can be proven to lead to a divergent path, then that would disprove
    the conjecture.


    Would running it out to some arbitrary precision prove it? No.

    Correct.


    Granted, finding a value where it failed would disprove it, but this is unlikely to happen.


    Again, you can't talk about "likely" or "unlikely" - you are giving quantitative (albeit vague) probabilities to something that is not
    random, so you should be careful about what you mean. There /have/ been proofs made about limits to the densities of possible exceptions, but
    that only gives you limits to the probability of finding an exception by arbitrarily picking a number within a given range and testing it.

    But, can note:
    -aFor all positive even numbers, n/2 is less than n;
    -aFor all positive odd numbers, n*3+1 is even;
    -a-a So, it becomes (n*3+1)/2
    -a For all positive even numbers:
    -a-a-a Probability of n/2 being an integer is 100%.
    -a-a-a Probability of n/2 being even is 50%.
    -a For (n*3+1)/2:
    -a-a-a Probability of this being even is 50%.
    -a-a-a-a-a So, 50% value gets bigger, 50% it gets smaller.
    -a-a-a Probability of a double odd: 25%
    -a-a-a Probability of a triple odd: 12.5%
    -a-a-a ...

    OK, so it would seem like the probability of for N finding a path that diverges, becomes 1/2^k, where k=log2(n).


    No. You are mixing statistical patterns and probabilities. The
    statistics are that most people who play in a lottery, lose out - but
    the probability that someone wins is high. And once the lottery is
    drawn, probabilities make no sense for individuals - either they won, or
    they lost. (The "Collatz lottery" has already been drawn, even though
    we don't know the results.)

    Your argument for the statistics is fair, and can be seen as a visual
    pattern if you draw a graph of the path length against the starting
    value. It has been proven that this pattern continues - there is a
    bound to the density of exceptions what are outside that pattern. But
    that does not tell you anything about probabilities of the path-length
    or cycle for any given case.

    Where, the paths that get smaller can't include a path that gets bigger unless the current path itself can get bigger, where the probability of
    this happening quickly approaches 0.

    If there was any place likely to find a divergence, it would be for
    small integers, this is empirically known to be false.


    Granted, saying that the probability approaches 0 is not the same as
    saying that the probability is 0.

    The probability of it being 0 seems directly proportional to the
    reciprocal:
    -a If p>(1/n): A divergence would be expected.
    -a If p<(1/n): A divergence would be impossible.
    -a But, p=1/n, as (1/n)=(1/(2**log2(n)))

    Though, since n*3+1 is always even, it is p=1/(2**ceil(log2(n))).

    1/(2**ceil(log2(n))) < (1/n) for all non-even integers (would only be
    equal for powers of 2, but powers of 2 always converge).

    So, it would seem like probability of divergence converges on 0.

    Dunno, though, this is just how it seems to me at the moment.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 22 16:35:02 2026
    From Newsgroup: comp.arch

    On Sun, 22 Feb 2026 12:45:44 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Sat, 21 Feb 2026 20:36:51 +0200
    Michael S <already5chosen@yahoo.com> wrote:


    Using a more brainiac approach would likely cut the performance in
    half and use a lot more LUTs.


    According to my experiments, combining 2 steps have very small
    negative effect on achivable clock and increases area by ~1.8x. So,
    to me it looks like a win.
    That's the combined step I am talking about:
    x = n/4;
    switch (x % 4) {
    case 0: n = x; break;
    case 1: n = 3*x+1; break;
    case 2: n = 2*x+2; break;
    case 3: n = 9*x+8; break;
    }




    Mistake in C above. Should be
    switch (n % 4) {

    I did notice that. :-)

    The one thing that worries me (sw on a 64-bit platform) about the
    code is the 9* on 128-bit variables:

    9*x+8 =>

    Do we use SHLD + SHL here or something else?

    How about MUL & LEA?

    ;; Input in r10:r9, output in rdx:rax
    mov rax,r9
    mul rax,rbx ;; RBX == 9
    lea r10,[r10+r10*8]
    add rdx,r10

    That looks like 5-6 clock cycles, so the branch misses from the
    switch statement would probably dominate unless you do as I suggested
    and use lookup tables instead:

    let bot2 = n & 3;
    let x = n >> 2;
    n = x*multab[bot2] + addtab[bot2];

    but if we do that, then (at least for a sw implementation) it would
    be better to pick a lot more of the LS bits, at least 8-12?

    Terje


    This variant is not intended for software implementation.
    For software it is obviously sub-optimal for the reasons you suggested.
    OTOH, for FPGA it seems to work a little better than both simpler and
    more complicated steps.

    I presented the code in C only because C is lingua franca of
    comp.arch.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun Feb 22 11:41:34 2026
    From Newsgroup: comp.arch

    It is a mathematically very interesting problem. Whether there is anything to be gained by continuing the search for exceptions by brute force, is
    a different matter - but /if/ it is worth doing, then doing so more efficiently is a good idea.

    AFAIK, in such games "more efficiently" tends to me largely irrelevant
    if it's more efficient by a (non-astronomical) constant factor.
    Instead, you need it to be algorithmically more efficient.


    === Stefan
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Feb 22 12:38:31 2026
    From Newsgroup: comp.arch

    On 2/22/2026 7:43 AM, David Brown wrote:
    On 21/02/2026 20:48, BGB wrote:
    On 2/15/2026 5:50 AM, David Brown wrote:
    On 14/02/2026 22:22, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 14/02/2026 19:48, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Some people are throwing large amounts of computing power at trying >>>>>>> to disprove the Collatz conjecture.-a Graphics cards have sped this >>>>>>> up enormously.

    (A reminder, if any is needed:-a From a starting number n, the
    transformation, recursively applied,


    -a-a-a-a f(n) = 3*n+1, if n is odd; =n/2 if n is even.

    is conjectured to always reach 1 for any positive integer 1).

    While n/2 remains even, n/2 is guaranteed to converge to 1.
    So, only the odd cases are interesting.

    3*n+1 is always even (since n was odd, and odd*odd is odd and
    odd+1 is even). So, a double iteration is (3*n+1)/2.

    Yes, that is often a micro-optimisation used.

    All the work is done on general-purpose hardware, which has many >>>>>>> capabilities that are not needed, and consume area and power that >>>>>>> special-purpose hardware would not need.-a Also, the hardware
    is limited to 64-bit integers, and the range of tested numbers
    is now up to 2^71, so

    Induction would say it has been proven by then.

    Would you like to re-think that?-a "Induction" says nothing of the
    sort.

    The conjecture has been proven true for all numbers up to 2 ^ 71
    (assuming that is the current figure) - it has most definitely /not/ >>>>> been proven to be true for all numbers.-a There are other things proven >>>>> about it, such as the asymptotic rarity of numbers that are exceptions >>>>> to the pattern, but no one has any notion of how to prove it in
    general.

    By proving up to 2^71, you simultaneously prove that it cannot be
    proven
    by computer-like arithmetic processes. What if the number was
    2^71,000,000
    you still have not proven it? Thereby, it is not provable by simple
    computer
    arithmetic processes.


    All that has been proven is that if the conjecture is not true, then
    the smallest exception to the pattern is greater than 2 ^ 71.-a Maybe
    it is 2 ^ 71 + 1.-a Current computer technology could take the search
    a lot further than just 2 ^ 71 - if someone wanted to pay for it!
    Future computer technology could perhaps greatly increase the bounds.
    Maybe Thomas's "useless machine" could lead to far greater bounds
    without excessive costs - and maybe research into that could lead to
    useful new technology or techniques with applications beyond an
    interesting maths problem.

    There are lots of examples of conjectures where the first exception
    is a high number, and all testing of low numbers fitted the
    conjectures. Sometimes such counter-examples are far beyond the reach
    of any brute- force computational testing, but sometimes not.

    We have not proven that the Collatz Conjecture can be disproven by
    brute-force testing - we have merely shown that doing so would be
    difficult, and seems unlikely to be successful.-a And we have known
    all along that computational methods can never prove a conjecture
    like this to be true - that much should be obvious.

    Testing conjectures like this can be helpful to mathematicians.
    Sometimes they can show patterns that can be used to get more
    insights - maybe there is a pattern in the numbers that lead to peaks
    of growth of the Collatz function before dropping towards 1.-a Such
    patterns can inspire proofs - or more targeted searches for exceptions.

    Ergo, you need a different means.

    You don't imagine that the only thing mathematicians have done with
    the Collatz Conjecture is test it to large numbers, do you?


    In theory, large numbers could disprove it, if an exception were found.
    However, it fails to prove it either; and if no value is found (or
    would be found) it doesn't prove anything.


    Being true up to 2 ^ 71 makes it reasonable to believe it is always
    true, but that is very different from being /proven/.-a There are
    plenty
    of mathematical conjectures that are true up to very large numbers, >>>>> but
    fail later on.

    You need cannot even /prove/ it using BigNum computer arithmetic in
    finite
    time.

    Surely everyone who has done high-school mathematics already realises
    that a conjecture like this cannot be proven by trail and error
    alone? Brute-force testing is about looking for exceptions to /
    disprove/ the conjecture, or gaining insights that can lead to its
    proof or disproof.


    Yeah, pretty much...

    My own thinking (mostly since looking at this thread, earlier today)
    has led me to realize first that the probability of any number
    containing a divergent path is less than the reciprocal of the value
    (related to prior post in this thread).


    Mathematicians don't consider someone's gut feelings as proof.-a At best,
    it can be an inspiration about what to focus on.-a Random guesses about
    what someone things is the probability of something are rarely of any
    use to anyone.

    It does not even make sense to talk about "probability" here - a number either leads to a divergent path, a convergent path ending in 1, or a different cycle.-a It is not a random process.

    You can make statements about the density of numbers that do not lead to
    1, or limits to the density.-a There has been a fair amount of work done
    on the Collatz Conjecture in this way, putting limits on the proportion
    of numbers that are divergent or lead to other cycles (it turns out they
    are very rare, if they exist at all).-a But those limits do not come from guesses, they come from proofs.

    Then, further realized that, since one doesn't just need *one* number
    with the probability of a divergent path, one needs an infinite series
    (since any non-divergent path immediately collapses back into a
    convergent path).

    You can't prove that a number leads to a divergent path by trial and
    error, even if you find such a number.-a But if someone found a number
    that appeared to lead to a divergent path, it may be possible to prove
    that it diverges by other methods.-a If alternative cycles exist, then
    they could be found by trial and error (though no one is hopeful that
    this will be practically possible).


    Effectively, one has to multiply an infinite series of near 0
    probabilities. Or, effectively, for all values of n, the effective
    probability of a non-zero probability converges to 1/Inf, or, 0...


    Such arguments are invariably nonsense unless you qualify them far
    better - especially when "probability" has no meaning here.


    Probably doesn't count as a sufficient proof by standards of
    mathematical rigor, but at least for me, I can say that it seems true
    enough that it always converges (and, will always continue to converge
    no matter how big the numbers get).


    You are correct that it doesn't count as proof.

    While in some hand wavy sense, one could try to push the probability
    to infinity to defeat the infinite reciprocal, infinity does not exist
    within the set of positive integers...



    Let me give another argument in the same line as yours - with the same
    level of "hand-wavy sense" and the same level of completely invalid mathematics.

    With a ruler (unmarked) and compass, you can construct lots of things by algorithmic sequences.-a You can, for example, bisect an arbitrary angle.
    -aYou can construct an exact 65537 sided polynomial.-a So if we take a random sequence of N steps, maybe it can be used to trisect an arbitrary angle.-a It's unlikely, but it is obvious that it is plausible.-a So we
    can say that there is a probability "p(N)" that any sequence of length N will trisect an arbitrary angle.-a No one has found such a trisector algorithm yet, so we know p(N) is small.-a However, it is also obvious
    that as the number of steps grows, so does this probability (always
    limited by 1).

    So now consider q(N) = 1 - p(N), the probability that no sequence of
    length N an trisect an angle.-a As we have reasoned, q(N) will start at 1 (you can't trisect an angle in one step), but will drop below 1 and decrease.-a The probability that /no/ sequence of steps will be able to trisect an angle is thus the product of all q(N).-a It's a product of a decreasing sequence of numbers between 1 and 0 - clearly it converges to 0.-a Thus at some point, there has to be a sequence of steps that can trisect an angle - we have proved its existence, even though we have not figured out how to do it.


    The problem with this argument is that it is bollocks.-a Real mathematics has proved that trisection of arbitrary angles by ruler and compass is impossible.

    We don't yet know if the Collatz Conjecture is true or not.-a But your "probability argument" is of no more worth there than it is for
    trisecting angles.


    I was initially seeing it more like a very large hash table.
    But, I have since realized it is not a hash table, it is a more
    organized pattern.

    But, yeah, not a math person, and I will make no claim of having
    "proven" anything...

    Rather, my aim was mostly to try to start organizing stuff enough to
    where I could get a sense of why it works, and doesn't explode off into infinite numbers.


    But, will note that this whole area is one major reason I am not really
    into math or physics. As a practical tool, they are useful, or can be
    seen as interesting detours.

    But, proofs kinda ruin the whole experience.


    But, yeah, at least for me, now seeing it as a sort of cone of
    power-of-2 rings with all paths forming into spirals which end at 1,
    each odd step effectively going half-way around a spiral formed from the number line but then ending up at an incremented position, etc, seems
    like a way to see why it does why it does what it does.

    Even if it doesn't count for anything, it isn't much worse than trying
    to test it empirically, which as noted, also does not count (it could
    only count if an exception could be found, but an exception wont be
    found no matter how big the search space becomes, so...).

    But, no proofs to be found here, I will not make this claim, in any case.

    ...

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Feb 22 14:48:13 2026
    From Newsgroup: comp.arch

    Michael S wrote:
    On Sat, 21 Feb 2026 15:39:42 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Robert Finch wrote:
    On 2026-02-19 5:10 a.m., Michael S wrote:

    If it is smaller than Kintex KU5P then I'd strongly suspect that
    you didn't clean out all mistakes.



    There could be more mistakes for sure. But I am sure the lowest
    level core works. It seemed to in simulation. I made a slight mod
    to it since I tested it last, checking n < startn.
    Counts are not used so the logic would be stripped out, but its
    good for verification.

    1 Core is:

    module Collatz_conjecture(rst, clk, startn, count, done);
    input rst;
    input clk;
    input [127:0] startn;
    output reg [127:0] count;
    output reg done;
    reg [127:0] n;

    always_ff @(posedge clk)
    if (rst) begin
    n <= startn;
    count <= 0;
    done <= FALSE;
    end
    else begin
    if (!done) begin
    if (~n[0])
    n <= n >> 2'd1;
    else
    n <= n+n+n+1;
    if (n < startn)
    done <= TRUE;
    count <= count + 1;
    end
    end

    endmodule
    I have a minimal knowledge of Verilog and may be wrong but...

    I am not a Verilog expert either (use VHDL exclusively), but to me the
    code of Robert Finch looks o.k from the HDL perspective (functionality
    is something else, I addressed it in my replay below).
    His Verilog appears to be an exact equivalent of VHDL code below,
    which is perfectly legal synthesizable code under VHDL-2008 rules.


    library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

    entity Collatz_conjecture is
    port (
    rst : in boolean;
    clk : in std_logic;
    startn : in unsigned(127 downto 0);
    count : out unsigned(127 downto 0);
    done : out boolean
    );
    end entity Collatz_conjecture;

    architecture a of Collatz_conjecture is
    signal n : unsigned(127 downto 0);
    begin
    process (clk)
    begin
    if rising_edge(clk) then
    if rst then
    n <= startn;
    count <= (others => '0');
    done <= false;
    elsif not done then
    if n(0)='0' then
    n <= '0' & n(n'high downto 1);
    else
    n <= n+n+n+1;
    end if;

    if n < startn then
    done <= true;
    end if;

    count <= count + 1;
    end if;
    end if;
    end process;
    end architecture a;



    Well then I just don't get this at all.
    I have a VHDL textbook that explicitly states that IN ports are input only,
    OUT ports are output only, and INOUT are input and output (tri-state).

    The language spec SystemVerilog IEEE_Std-1800-2017
    (it's a horrid language - the spec is 1315 pages long,
    and so complex that no compiler implements all of it,
    and some compilers still use the pre-2005 language spec)

    says in section "23.3.3.2 Port connection rules for variables" that "Assignments to variables declared as input ports shall be illegal." "Procedural or continuous assignments to a variable connected to the
    output port of an instance shall be illegal."

    but it also says "An output port can be connected to a net (or a
    concatenation) of a compatible data type. In this case, multiple drivers
    shall be permitted on the net"

    and in "23.3.3.1 Port coercion
    A port that is declared as input (output) but used as an output (input)
    or inout may be coerced to inout. If not coerced to inout, a warning
    shall be issued."

    but never defines what "coerced" means or how to do it.

    Verilog-2017 also has the "uwire" data type which is a uni-driver type
    that only allows a single driver for the wire.
    (But some compilers don't support uwire)

    "6.6.2 Unresolved nets
    The uwire net is an unresolved or unidriver wire and is used to model nets
    that allow only a single driver. The uwire type can be used to enforce this restriction. It shall be an error to connect any bit of a uwire net to more than one driver. It shall be an error to connect a uwire net to a
    bidirectional terminal of a bidirectional pass switch.

    The port connection rule in 23.3.3.6 enforces this restriction across
    the net hierarchy or shall issue a warning if not."

    "23.3.3.6 Single source nets (uwire)
    If the net on either side of a port has the net type uwire,
    a warning shall be issued if the nets are not merged
    into a single net, as described in 23.3.3.7."

    and spec contradicts itself as to whether it is an error or warning.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 22 23:25:47 2026
    From Newsgroup: comp.arch

    On Sun, 22 Feb 2026 14:48:13 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Michael S wrote:
    On Sat, 21 Feb 2026 15:39:42 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Robert Finch wrote:
    On 2026-02-19 5:10 a.m., Michael S wrote:

    If it is smaller than Kintex KU5P then I'd strongly suspect that
    you didn't clean out all mistakes.



    There could be more mistakes for sure. But I am sure the lowest
    level core works. It seemed to in simulation. I made a slight mod
    to it since I tested it last, checking n < startn.
    Counts are not used so the logic would be stripped out, but its
    good for verification.

    1 Core is:

    module Collatz_conjecture(rst, clk, startn, count, done);
    input rst;
    input clk;
    input [127:0] startn;
    output reg [127:0] count;
    output reg done;
    reg [127:0] n;

    always_ff @(posedge clk)
    if (rst) begin
    n <= startn;
    count <= 0;
    done <= FALSE;
    end
    else begin
    if (!done) begin
    if (~n[0])
    n <= n >> 2'd1;
    else
    n <= n+n+n+1;
    if (n < startn)
    done <= TRUE;
    count <= count + 1;
    end
    end

    endmodule
    I have a minimal knowledge of Verilog and may be wrong but...

    I am not a Verilog expert either (use VHDL exclusively), but to me
    the code of Robert Finch looks o.k from the HDL perspective
    (functionality is something else, I addressed it in my replay
    below). His Verilog appears to be an exact equivalent of VHDL code
    below, which is perfectly legal synthesizable code under VHDL-2008
    rules.


    library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

    entity Collatz_conjecture is
    port (
    rst : in boolean;
    clk : in std_logic;
    startn : in unsigned(127 downto 0);
    count : out unsigned(127 downto 0);
    done : out boolean
    );
    end entity Collatz_conjecture;

    architecture a of Collatz_conjecture is
    signal n : unsigned(127 downto 0);
    begin
    process (clk)
    begin
    if rising_edge(clk) then
    if rst then
    n <= startn;
    count <= (others => '0');
    done <= false;
    elsif not done then
    if n(0)='0' then
    n <= '0' & n(n'high downto 1);
    else
    n <= n+n+n+1;
    end if;

    if n < startn then
    done <= true;
    end if;

    count <= count + 1;
    end if;
    end if;
    end process;
    end architecture a;



    Well then I just don't get this at all.
    I have a VHDL textbook that explicitly states that IN ports are input
    only, OUT ports are output only, and INOUT are input and output
    (tri-state).


    Sounds like old textbook, not updated with VHDL-2008 changes related to
    what is legal for 'out' ports.
    See, for example, here: https://docs.amd.com/r/en-US/ug901-vivado-synthesis/Reading-Output-Ports


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Feb 22 19:12:49 2026
    From Newsgroup: comp.arch

    On 2026-02-21 1:36 p.m., Michael S wrote:
    On Sat, 21 Feb 2026 01:56:03 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    On 2026-02-19 5:10 a.m., Michael S wrote:
    On Wed, 18 Feb 2026 22:56:19 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    On 2026-02-18 11:22 a.m., Michael S wrote:
    On Sun, 15 Feb 2026 10:56:51 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    I have coded a version that tests up to 8192 number
    simultaneously. It is supposed to be able to run at over 200
    MHz. Some of the number would take a large number of iterations
    though.

    I found something strange, the code to count for more numbers
    does not increase the size of the core very much. (But it takes
    longer to build.) I am not sure if this occurs because there is
    a mitsake somewhere, or if the tools are able to figure out how
    to share all the logic. I am thinking the logic for adders is
    shared somehow. 128-bit arithmetic is being used.


    There is a mistake in your code. No other explanation is possible.

    You should look for 3 sorts of mistakes
    1. Core itself.
    The most likely mistake is failing to look for one of the ending
    conditions. You have to look for 3 conditions:
    a) - value produced at current step is smaller than initial value.
    That is successful ending. Here you signal to the scheduler
    that you are ready to accept the next value.
    b) - overflow. Next value exceeds the width of register.
    c) - timeout. You were running for more than N steps. N=8192 looks
    like the reasonable value.
    a) and b) are suspected failures.
    The handling for b) and c) is similar - you raise alert signal and
    stop. Your test unit has to provide supervisor circuity with the
    mean to fetch the initial value that caused suspected failure and
    to clear alert condition.
    After alert is cleared your unit returns to the same state as
    after success - waiting for next initial value.

    I found the mistake. Core was working, but groups of cores were
    not. Done signals were missing.


    2. Scheduler.
    In proof-of-concept design the scheduler does not have to be very
    realistic. For example, it does not have to be efficiently
    pipelined. But it should be able to monitor ready signals of all
    test units and to supply different initial data to different
    units. Supplying the same data to all units simultaneously is a
    sort of mistake that explains your observed behavior.

    There is really simple scheduling ATM. There is just a bunch of
    done signals that are gated together. Meaning the slowest
    calculation of the group will slow everything else down. Once all
    the dones are true the next set of numbers are loaded.

    3. Supervisor
    Again, in proof-of-concept design the supervisor does not have to
    be very realistic. But it should be able to selectively exercise
    all core feature related to alert, including fetching of initial
    data and clearing of alert.

    After you coded everything correctly, do not even try to compile
    test with 8000 units. That is totally unrealistic. Even if you
    find huge FPGA in which such number of testers could fit (and I
    am not sure that such FPGA exists) the compilation will take many
    hours, possibly days.



    Yeah 8,000 was definitely out of the picture. I had since coded
    24x24 matrix (576) and although the logic would fit that turned
    out not to be able to route. So, it is down to 20x20. 20x20 with
    no real scheduling fits.

    I tried displaying progress as an on-screen bitmap, but that just
    appeared noisy. I have been working on an SCP so may be able to use
    that to manage the scheduling with some custom logic.




    Does 20x20 means 400 cores with something like 80-bit initial values
    and 128-bit processing?
    Yes. 128-bit processing.

    If you didn't make further mistakes then it should be rather big,
    order of 300K LEs. It would not fit in the biggest low-end Altera
    device (Cyclone 10GX 10CX220).

    IIRC it is about 150k LUTs. It uses the simplest approach, which is
    not very fast iteration wise, but it allows a lot of cores and a fast
    clock cycles time. One core is very small and has a fast iteration (1
    clock cycle). But it takes a lot more iterations than can be done
    with a more brainiac approach.

    Using a more brainiac approach would likely cut the performance in
    half and use a lot more LUTs.


    According to my experiments, combining 2 steps have very small negative effect on achivable clock and increases area by ~1.8x. So, to me it
    looks like a win.
    That's the combined step I am talking about:
    x = n/4;
    switch (x % 4) {
    case 0: n = x; break;
    case 1: n = 3*x+1; break;
    case 2: n = 2*x+2; break;
    case 3: n = 9*x+8; break;
    }



    On the Xilinx side, it would not fit in any Spartan and in most
    Artix devices. May be, it would fit in the biggest Artix
    UltraScale+ (AU25P). But more realistically for 400 test units one
    would want Arria 10 on Altera side or Kintex on Xilinx side. Both
    are not prohibitively expensive, but not cheap.

    What device are you using?
    I am not using an too inexpensive device. I have a Kintex-325 with
    about 200k LUTs.


    According to Xilinx marketing materials XC7K325T has 326080 equivalents
    of traditional LEs.
    That explains why your design fits.

    Pay attention that while Kintex evaluation boards could be
    non-expensive, right now buying devices for actual production work is
    rather costly, especially in small quantities.
    The lowest price I see on digikey site is 1147 USD. That's a bad value relatively to many other FPGAs.


    If it is smaller than Kintex KU5P then I'd strongly suspect that you
    didn't clean out all mistakes.



    There could be more mistakes for sure. But I am sure the lowest level
    core works. It seemed to in simulation. I made a slight mod to it
    since I tested it last, checking n < startn.
    Counts are not used so the logic would be stripped out, but its good
    for verification.

    1 Core is:

    module Collatz_conjecture(rst, clk, startn, count, done);
    input rst;
    input clk;
    input [127:0] startn;
    output reg [127:0] count;
    output reg done;
    reg [127:0] n;

    always_ff @(posedge clk)
    if (rst) begin
    n <= startn;
    count <= 0;
    done <= FALSE;
    end
    else begin
    if (!done) begin
    if (~n[0])
    n <= n >> 2'd1;
    else
    n <= n+n+n+1;
    if (n < startn)
    done <= TRUE;
    count <= count + 1;
    end
    end

    endmodule



    Ok.
    1. Your core misses one critical part - overflow check.
    Assuming 80-bit initial count, 128-bit register will overflow
    eventually. It would happen very rarely, but it would happen.
    The check for overflow does not have to be precise. Conservative
    approach, like looking at 2 MS bits and one LS bit and signaling
    overflow not when it really happened but when it *could* happen, i.e.
    when MS bits > 0 and LS bit='1' is good enough. But the check can't be omitted.

    2. I see that instead of calculating n = (n*3+1)/2 as one step you are
    doing it in two separate steps. I am not sure that it is really makes
    your core smaller or faster. And it certainly increases # of iterations
    by 1/3rd.

    3.
    128-bit count is pointless loss of resources. 12 or 13 bit are
    sufficient. If the count exceeded that size then you already found
    something of high interest and it is a job of supervisor to find out
    what exactly had happened.
    128-bit counter actually not just too big, it is also pointless,
    because if we pretend to believe that our purpose is to disprove
    Collatz's conjecture then supervisor would have to implement stop
    condition with some sort of timeout that will effectively duplicate functionality of the counter.

    4.
    Your module does not latch the value of startn. It means that caller
    module has to do it on it's behalf. That is the most likely place where
    one can make a mistake.

    5.
    Coding style comment. Normally, name rst is used for asynchronous
    reset signal. In your design rst is a synchronous signal. For the
    benefit of casual reader I suggest to find different name.


    Re-wrote the core using Michael's suggestion of iterations. Looks
    something like the below code.
    Renamed the rst signal ld to better reflect its usage.
    Timing is close to the same, but not as many cores can fit. It is a
    trade-off, fewer cores, but they process more iterations in one go. Net
    result is more numbers processed in the same length of time.
    I am using 16x16 cores now. Overflow check using 130-bit numbers.

    import const_pkg::*;

    module Collatz_conjecture(clk, ld, startn, done, overflow);
    input clk;
    input ld;
    input [127:0] startn;
    output reg done;
    output reg overflow;

    reg [129:0] n;

    always_ff @(posedge clk)
    if (ld)
    n <= 130'(startn);
    else begin
    if (!done)
    case(n[1:0])
    2'd0: n <= n >> 2'd2;
    2'd1: n <= n[129:2]+n[129:2]+n[129:2]+2'd1;
    2'd2: n <= n[129:2]+n[129:2]+2'd2;
    2'd3: n <= {n[129:2],3'd0}+n[129:2]+4'd8;
    endcase
    end

    always_ff @(posedge clk)
    if (ld)
    done <= FALSE;
    else begin
    if (!done) begin
    if (n < startn || (|n[129:128]))
    done <= TRUE;
    end
    end

    always_ff @(posedge clk)
    if (ld)
    overflow <= FALSE;
    else begin
    if (!done) begin
    if (|n[129:128])
    overflow <= TRUE;
    end
    end

    endmodule


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Feb 22 18:41:50 2026
    From Newsgroup: comp.arch

    On 2/19/2026 2:27 PM, Stefan Monnier wrote:
    I must say I find this whole thread baffling: when searching for
    a counter example and we're already in the order of 2^70, it seems clear
    that constant-factor improvements are completely irrelevant (even with
    large constants).

    I guess it all boils down to the "useless" part of the title?


    Yeah, basically.

    It is useless in the sense that no matter how big you go, an exception
    wont be found, so rendering further testing essentially meaningless
    (doesn't prove anything unless an exception is found, exception wont be
    found if none can exist).



    I did feel tempted to write a program to draw a graph...

    Well, then did so, and posted a partial graph on X: https://x.com/cr88192/status/2025698159519817849

    Though, a linear/log mismatch makes it look wonky, and got annoyed
    fiddling with trying to make the code to draw the graph not make it look
    like a mess.

    It is like, now realizing how it works and the fairly obvious nature of
    it being true, but also realizing that I am also not in a position to be
    able to prove it as such, is making me feel stupid.

    But, alas, there are reasons I am not really a math of physics person...



    Well, at least excluding the guys who go around promoting wacky/crazy
    ideas like recursive compression and perpetual motion machines and
    similar (or, all the "quantum consciousness" stuff and similar),
    wouldn't really want to be associated with this crowd either...

    Actually, this latter camp I don't get:
    If you can easily test things and see that they don't work, what is the
    point of continuing to promote them as-if they do?...



    Or, like the whole fusion thing:
    Can someone make it happen? Yes.
    Can one get net positive energy output? Not so much.

    Or, the seeming futility of trying to send humans on interstellar trips. Basically, no way to send full sized humans interstellar within any
    plausible timescale or resource/energy budget. That said, not entirely convinced it is outside the reach of machines (so, my sci-fi stuff
    wouldn't involve sending humans, rather machines with enough to try to bootstrap the process).

    Ship can get to target system, find some asteroids or similar to begin consuming, building itself up into a more significant structure, then
    begins building other machinery it needs; building whatever it needs to initiate the colonization process. Even sending a full-set of organic
    samples is a bit of a stretch, so more likely it would preferentially bootstrap high priority species from actual cell lines, but may need
    backup strategies (recreating the whole biosphere from yeast or similar
    if need-be). In some sense, the main "payload" not likely being lots of organic matter, but enough dense information storage to be able to
    initiate the bootstrap process (and a machine just smart enough to be
    able to figure out how to find suitable asteroids and start the process).

    Galactic civilizations? No. Any such colonies being unlikely to realize
    the others exist, much less have any real way to communicate with them (light-speed limits effectively making any sort of large scale
    "civilization" all but impossible).

    Etc.



    Well, except in cases where I am feeling half tempted to do stories that
    just throw realism entirely out the window, "yeah, errm, how about we
    have wooden pirate ships, in space, fighting giant robots?..." maybe
    take the "solar wind" part a little too literally: have air in space and
    the ships just sorta tack-sailing on the wind ("moment of inertia be
    damned", things still move like they were floating on water, but in
    space). Stuff basically works like in 1980s cartoons (like, you know,
    the ones that mostly existed as glorified toy commercials). Then the
    worlds' physics are based less on realism and more a mix of "rule of
    cool" and "rule of funny".

    Say, also:
    Ships battle off giant robots by shooting at them with old style canons;
    People can expand to 50ft tall to fist-fight said robots;
    Rocket flatulence (can fly around by essentially farting fire);
    Trinket based costume changes;
    Giant robots vs dinosaurs (maybe also in space, dinosaurs also having
    rocket flatulence, and fire breath, ...);
    Obvious plot armor;
    People can escape restraints by burning them off with eye lasers;
    ...

    But, say:
    Characters are not just going around like Superman or something (because
    then there are no stakes), and thus only able to use their
    higher-powered abilities when the plot demands it.

    Did recently throw a fragment of this sort of idea at Grok Imagine, but
    while it didn't give exactly what I was imagining, it was in the same
    general area:
    https://x.com/cr88192/status/2025021520507011292

    But, alas in this case the video generation only gives a few seconds
    (for much more than this, would cost money...).


    Because, alas, reality is just kinda depressing sometimes...


    Well, except a while back, when I did write a story set in a sort of reverse-urban-fantasy setting:
    As opposed to the typical "fantasy stuff in an urban setting";
    More "stereotypical urban stuff in a fantasy setting".

    So, more rap-battling gangsta elves and stuff...


    Then again, "pirate ships in space" wouldn't exactly be too far out of
    place in a fantasy world. Could pass it off as "sci-fi" in a loose sense
    in that while it is fairly obviously a fantasy setting, many of the
    characters are not themselves aware that there is anything unusual with
    them living within a fantasy-world magic system. But, then may as well
    borrow ideas from "Rainbow Brite" or something, and have the ships
    powered by "Star Sprinkles" or such (well, and puts a limit on how much characters can do: The more powerful magic they use, the more rapidly
    they use up their supply of star sprinkles...).

    ...



    But, Alas...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Feb 22 19:47:32 2026
    From Newsgroup: comp.arch

    On 2026-02-22 7:12 p.m., Robert Finch wrote:
    On 2026-02-21 1:36 p.m., Michael S wrote:
    On Sat, 21 Feb 2026 01:56:03 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    On 2026-02-19 5:10 a.m., Michael S wrote:
    On Wed, 18 Feb 2026 22:56:19 -0500
    Robert Finch <robfi680@gmail.com> wrote:
    On 2026-02-18 11:22 a.m., Michael S wrote:
    On Sun, 15 Feb 2026 10:56:51 -0500
    Robert Finch <robfi680@gmail.com> wrote:
    I have coded a version that tests up to 8192 number
    simultaneously. It is supposed to be able to run at over 200
    MHz. Some of the number would take a large number of iterations
    though.

    I found something strange, the code to count for more numbers
    does not increase the size of the core very much. (But it takes
    longer to build.) I am not sure if this occurs because there is
    a mitsake somewhere, or if the tools are able to figure out how
    to share all the logic. I am thinking the logic for adders is
    shared somehow. 128-bit arithmetic is being used.

    There is a mistake in your code. No other explanation is possible. >>>>>>
    You should look for 3 sorts of mistakes
    1. Core itself.
    The most likely mistake is failing to look for one of the ending
    conditions. You have to look for 3 conditions:
    a) - value produced at current step is smaller than initial value. >>>>>> -a-a-a That is successful ending. Here you signal to the scheduler >>>>>> that you are ready to accept the next value.
    b) - overflow. Next value exceeds the width of register.
    c) - timeout. You were running for more than N steps. N=8192 looks >>>>>> like the reasonable value.
    a) and b) are suspected failures.
    The handling for b) and c) is similar - you raise alert signal and >>>>>> stop. Your test unit has to provide supervisor circuity with the
    mean to fetch the initial value that caused suspected failure and
    to clear alert condition.
    After alert is cleared your unit returns to the same state as
    after success - waiting for next initial value.

    I found the mistake. Core was working, but groups of cores were
    not. Done signals were missing.

    2. Scheduler.
    In proof-of-concept design the scheduler does not have to be very
    realistic. For example, it does not have to be efficiently
    pipelined. But it should be able to monitor ready signals of all
    test units and to supply different initial data to different
    units. Supplying the same data to all units simultaneously is a
    sort of mistake that explains your observed behavior.
    There is really simple scheduling ATM. There is just a bunch of
    done signals that are gated together. Meaning the slowest
    calculation of the group will slow everything else down. Once all
    the dones are true the next set of numbers are loaded.
    3. Supervisor
    Again, in proof-of-concept design the supervisor does not have to
    be very realistic. But it should be able to selectively exercise
    all core feature related to alert, including fetching of initial
    data and clearing of alert.

    After you coded everything correctly, do not even try to compile
    test with 8000 units. That is totally unrealistic. Even if you
    find huge FPGA in which such number of testers could fit (and I
    am not sure that such FPGA exists) the compilation will take many
    hours, possibly days.


    Yeah 8,000 was definitely out of the picture. I had since coded
    24x24 matrix (576) and although the logic would fit that turned
    out not to be able to route. So, it is down to 20x20. 20x20 with
    no real scheduling fits.

    I tried displaying progress as an on-screen bitmap, but that just
    appeared noisy. I have been working on an SCP so may be able to use
    that to manage the scheduling with some custom logic.



    Does 20x20 means 400 cores with something like 80-bit initial values
    and 128-bit processing?
    Yes. 128-bit processing.

    If you didn't make further mistakes then it should be rather big,
    order of 300K LEs. It would not fit in the biggest low-end Altera
    device (Cyclone 10GX 10CX220).

    IIRC it is about 150k LUTs. It uses the simplest approach, which is
    not very fast iteration wise, but it allows a lot of cores and a fast
    clock cycles time. One core is very small and has a fast iteration (1
    clock cycle). But it takes a lot more iterations than can be done
    with a more brainiac approach.

    Using a more brainiac approach would likely cut the performance in
    half and use a lot more LUTs.


    According to my experiments, combining 2 steps have very small negative
    effect on achivable clock and increases area by ~1.8x. So, to me it
    looks like a win.
    That's the combined step I am talking about:
    -a-a x = n/4;
    -a-a switch (x % 4) {
    -a-a-a-a case 0: n = x; break;
    -a-a-a-a case 1: n = 3*x+1; break;
    -a-a-a-a case 2: n = 2*x+2; break;
    -a-a-a-a case 3: n = 9*x+8; break;
    -a-a }



    On the Xilinx side, it would not fit in any Spartan and in most
    Artix devices. May be, it would fit in the biggest Artix
    UltraScale+ (AU25P). But more realistically for 400 test units one
    would want Arria 10 on Altera side or Kintex on Xilinx side. Both
    are not prohibitively expensive, but not cheap.

    What device are you using?
    I am not using an too inexpensive device. I have a Kintex-325 with
    about 200k LUTs.


    According to Xilinx marketing materials XC7K325T has 326080 equivalents
    of traditional LEs.
    That explains why your design fits.

    Pay attention that while Kintex evaluation boards could be
    non-expensive, right now buying devices for actual production work is
    rather costly, especially in small quantities.
    The lowest price I see on digikey site is 1147 USD. That's a bad value
    relatively to many other FPGAs.


    If it is smaller than Kintex KU5P then I'd strongly suspect that you
    didn't clean out all mistakes.


    There could be more mistakes for sure. But I am sure the lowest level
    core works. It seemed to in simulation. I made a slight mod to it
    since I tested it last, checking n < startn.
    Counts are not used so the logic would be stripped out, but its good
    for verification.

    1 Core is:

    module Collatz_conjecture(rst, clk, startn, count, done);
    input rst;
    input clk;
    input [127:0] startn;
    output reg [127:0] count;
    output reg done;
    reg [127:0] n;

    always_ff @(posedge clk)
    if (rst) begin
    -a-a-a-an <= startn;
    -a-a-a-acount <= 0;
    -a-a-a-adone <= FALSE;
    end
    else begin
    -a-a-a-aif (!done) begin
    -a-a-a-a-a-a-a if (~n[0])
    -a-a-a-a-a-a-a-a-a-a-a n <= n >> 2'd1;
    -a-a-a-a-a-a-a else
    -a-a-a-a-a-a-a-a-a-a-a n <= n+n+n+1;
    -a-a-a-a-a-a-a if (n < startn)
    -a-a-a-a-a-a-a-a-a-a-a done <= TRUE;
    -a-a-a-a-a-a-a count <= count + 1;
    -a-a-a-aend
    end

    endmodule



    Ok.
    1. Your core misses one critical part - overflow check.
    Assuming 80-bit initial count, 128-bit register will overflow
    eventually. It would happen very rarely, but it would happen.
    The check for overflow does not have to be precise. Conservative
    approach, like looking at 2 MS bits and one LS bit and signaling
    overflow not when it really happened but when it *could* happen, i.e.
    when MS bits > 0 and LS bit='1' is good enough. But the check can't be
    omitted.

    2. I see that instead of calculating n = (n*3+1)/2 as one step you are
    doing it in two separate steps. I am not sure that it is really makes
    your core smaller or faster. And it certainly increases # of iterations
    by 1/3rd.

    3.
    128-bit count is pointless loss of resources. 12 or 13 bit are
    sufficient. If the count exceeded that size then you already found
    something of high interest and it is a job of supervisor to find out
    what exactly had happened.
    128-bit counter actually not just too big, it is also pointless,
    because if we pretend to believe that our purpose is to disprove
    Collatz's conjecture then supervisor would have to implement stop
    condition with some sort of timeout that will effectively duplicate
    functionality of the counter.

    4.
    Your module does not latch the value of startn. It means that caller
    module has to do it on it's behalf. That is the most likely place where
    one can make a mistake.

    5.
    Coding style comment. Normally, name rst is used for asynchronous
    reset signal. In your design rst is a synchronous signal. For the
    benefit of casual reader I suggest to find different name.


    Re-wrote the core using Michael's suggestion of iterations. Looks
    something like the below code.
    Renamed the rst signal ld to better reflect its usage.
    Timing is close to the same, but not as many cores can fit. It is a trade-off, fewer cores, but they process more iterations in one go. Net result is more numbers processed in the same length of time.
    I am using 16x16 cores now. Overflow check using 130-bit numbers.

    import const_pkg::*;

    module Collatz_conjecture(clk, ld, startn, done, overflow);
    input clk;
    input ld;
    input [127:0] startn;
    output reg done;
    output reg overflow;

    reg [129:0] n;

    always_ff @(posedge clk)
    if (ld)
    -a-a-a-an <= 130'(startn);
    else begin
    -a-a-a-aif (!done)
    -a-a-a-a-a-a-a case(n[1:0])
    -a-a-a-a-a-a-a 2'd0:-a-a-a n <= n >> 2'd2;
    -a-a-a-a-a-a-a 2'd1:-a-a-a n <= n[129:2]+n[129:2]+n[129:2]+2'd1;
    -a-a-a-a-a-a-a 2'd2:-a-a-a n <= n[129:2]+n[129:2]+2'd2;
    -a-a-a-a-a-a-a 2'd3:-a-a-a n <= {n[129:2],3'd0}+n[129:2]+4'd8;
    -a-a-a-a-a-a-a endcase
    end

    always_ff @(posedge clk)
    if (ld)
    -a-a-a-adone <= FALSE;
    else begin
    -a-a-a-aif (!done) begin
    -a-a-a-a-a-a-a if (n < startn || (|n[129:128]))
    -a-a-a-a-a-a-a-a-a-a-a done <= TRUE;
    -a-a-a-aend
    end

    always_ff @(posedge clk)
    if (ld)
    -a-a-a-aoverflow <= FALSE;
    else begin
    -a-a-a-aif (!done) begin
    -a-a-a-a-a-a-a if (|n[129:128])
    -a-a-a-a-a-a-a-a-a-a-a overflow <= TRUE;
    -a-a-a-aend
    end

    endmodule


    Just noticed overflow needs a couple more extra bits since multiplying by 9.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Feb 23 10:51:12 2026
    From Newsgroup: comp.arch

    On 22/02/2026 19:38, BGB wrote:

    I was initially seeing it more like a very large hash table.
    But, I have since realized it is not a hash table, it is a more
    organized pattern.

    As far as we know, the result of apply the Collatz function (calculating
    the number of steps to reach 1, if it ever does) follows a pattern statistically but not any clear pattern for individual values. There
    are a lot of mathematical functions like that - a good example being the number of prime factors in a number. You can draw graphs and visually
    see nice patterns - some of which can be proven, others are still
    unproven and may not apply over a big enough scale. Sometimes there are tricks or hints to ways to improve your chances when guessing which
    values might be outliers.

    There /is/ a pattern here, in that the Collatz function is fully
    deterministic and thus generates the pattern. But it is not clear if
    there is any other way to determine the individual values.


    But, yeah, not a math person, and I will make no claim of having
    "proven" anything...

    Rather, my aim was mostly to try to start organizing stuff enough to
    where I could get a sense of why it works, and doesn't explode off into infinite numbers.


    If you can explain that, that would be a huge breakthrough! The maths
    showing the average growth rates is clear enough (you already showed
    it), and it has been formalised to show that for most numbers, there are limits on their growth. But that doesn't say anything about whether or
    not there are some numbers for which the Collatz function diverges, or
    some numbers for which they form a cycle other than the 4, 2, 1 cycle. (Remember, there are two possible ways for numbers to fail the Collatz Conjecture - not just diverging ones.)


    But, will note that this whole area is one major reason I am not really
    into math or physics. As a practical tool, they are useful, or can be
    seen as interesting detours.

    But, proofs kinda ruin the whole experience.


    I don't think any mathematicians see the Collatz Conjecture as a
    practical tool! But sometimes practical tools come out of working on
    such problems. And often it is the proofs involved in such maths that
    are the attractive aspects. (Proofs along the way, gradually pushing
    the limits of parts of a problem, are typically less interesting to the layman.)

    You can get a lot of appreciation for these kinds of problems by doing
    some of the calculations yourself. This thread has shown that it can be
    a fun coding challenge (for cpus, gpus, fpgas, or whatever). But you
    can also draw some graphs and see how the function behaves. And there
    are many ways to extend the idea to other Collatz-like functions, or
    other domains than just the positive integers.


    But, yeah, at least for me, now seeing it as a sort of cone of
    power-of-2 rings with all paths forming into spirals which end at 1,
    each odd step effectively going half-way around a spiral formed from the number line but then ending up at an incremented position, etc, seems
    like a way to see why it does why it does what it does.

    Even if it doesn't count for anything, it isn't much worse than trying
    to test it empirically, which as noted, also does not count (it could
    only count if an exception could be found, but an exception wont be
    found no matter how big the search space becomes, so...).

    But, no proofs to be found here, I will not make this claim, in any case.


    Fair enough.

    If you are interested in the problem and want to see some videos about
    it, I can recommend:

    Numberphile (good fun maths channel, without requiring much knowledge) <https://www.youtube.com/watch?v=5mFpVDpKX70>

    Veritasium (great science and fact channel) <https://www.youtube.com/watch?v=094y1Z2wpJg>


    Other fun maths proof videos to make you appreciate proofs more!

    Vsauce (science and maths channel) <https://www.youtube.com/watch?v=s86-Z-CbaHA>

    3Blue1Brown (science and maths channel) <https://www.youtube.com/watch?v=yuVqxCSsE7c>

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Feb 23 10:53:30 2026
    From Newsgroup: comp.arch

    On 22/02/2026 17:41, Stefan Monnier wrote:
    It is a mathematically very interesting problem. Whether there is anything >> to be gained by continuing the search for exceptions by brute force, is
    a different matter - but /if/ it is worth doing, then doing so more
    efficiently is a good idea.

    AFAIK, in such games "more efficiently" tends to me largely irrelevant
    if it's more efficient by a (non-astronomical) constant factor.
    Instead, you need it to be algorithmically more efficient.


    Sure - algorithmic improvements usually (but not always in practice)
    lead to the biggest efficiency gains. But if you are doing a "test
    every number up to N" check of a conjecture, /any/ efficiency
    improvements are helpful.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Feb 23 11:10:01 2026
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:
    On 2/19/2026 2:27 PM, Stefan Monnier wrote:
    I must say I find this whole thread baffling: when searching for
    a counter example and we're already in the order of 2^70, it seems clear
    that constant-factor improvements are completely irrelevant (even with
    large constants).

    I guess it all boils down to the "useless" part of the title?


    Yeah, basically.

    It is useless in the sense that no matter how big you go, an exception
    wont be found, so rendering further testing essentially meaningless
    (doesn't prove anything unless an exception is found, exception wont be found if none can exist).

    You are assuming that the Collatz conjecture is true. But this
    is unknown, and the whole point of the excercise.

    Do I personally think that exchaustive search will find something? No.

    Does my personal belief matter in the slightest? No.

    [...]

    Or, the seeming futility of trying to send humans on interstellar trips. Basically, no way to send full sized humans interstellar within any plausible timescale or resource/energy budget.

    My favorite candiate for that is the nuclear salt water rocket.
    Which will be difficult to build, or so my son (who just received
    his master's degree in nuclear engineering) tells me because
    it needs to be prompt critical.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Feb 23 10:52:15 2026
    From Newsgroup: comp.arch

    David Brown [2026-02-23 10:53:30] wrote:
    Sure - algorithmic improvements usually (but not always in practice)
    lead to the biggest efficiency gains. But if you are doing a "test
    every number up to N" check of a conjecture, /any/ efficiency
    improvements are helpful.

    But once you get to 2^70, the exponential explosion means that
    constant-factor improvements are vanishingly helpful.


    === Stefan
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Feb 23 17:24:49 2026
    From Newsgroup: comp.arch

    On 2/23/2026 5:10 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:
    On 2/19/2026 2:27 PM, Stefan Monnier wrote:
    I must say I find this whole thread baffling: when searching for
    a counter example and we're already in the order of 2^70, it seems clear >>> that constant-factor improvements are completely irrelevant (even with
    large constants).

    I guess it all boils down to the "useless" part of the title?


    Yeah, basically.

    It is useless in the sense that no matter how big you go, an exception
    wont be found, so rendering further testing essentially meaningless
    (doesn't prove anything unless an exception is found, exception wont be
    found if none can exist).

    You are assuming that the Collatz conjecture is true. But this
    is unknown, and the whole point of the excercise.

    Do I personally think that exchaustive search will find something? No.

    Does my personal belief matter in the slightest? No.



    My own looking into this led me to realize that the number-space is
    modulo (relative to a cone with a spiral that repeats with every
    power-of-2), and that all the odd steps only ever increment the phase
    angle by 1, or fall down to a lower level (whenever the phase angle is
    not incremented).

    In effect, each odd step jumps to the opposite side of the cone, but
    slightly increments the angle. Since the phase-angle path can never move backwards, it can't touch any previous paths, thus, no loops can happen
    this way.


    The only way a cycle could exist is if (somehow) there were an
    impassable wall where pretty much *no* value could descend unless it
    hits the power-of-2 descent path, and then somehow miss this path as well.

    But, since "3*n+1" results in the minimum forward step, this means that
    one of the sides *has* to hit the power-of-2 descent cases at some point.

    Even if you did get very unlucky and hit odd-paths at nearly every step,
    it would eventually hit the power-of-2 escape case which is basically an immediate descent down to 1.

    Or, as I see it, the geometry of the number space itself prevents the possibility of a loop.


    Like, while I can't write a proof for it, I can still see that it is
    true in a geometric/mechanistic sense.

    Like, my confidence that it is true isn't so much because of a "no
    exception has been found, so I believe it is true" sense, but more
    because I can see the geometry of the number space well enough to see
    that no non-descending paths can exist through the space.


    [...]

    Or, the seeming futility of trying to send humans on interstellar trips.
    Basically, no way to send full sized humans interstellar within any
    plausible timescale or resource/energy budget.

    My favorite candiate for that is the nuclear salt water rocket.
    Which will be difficult to build, or so my son (who just received
    his master's degree in nuclear engineering) tells me because
    it needs to be prompt critical.


    Yeah, I have floated a few ideas...


    One I used in my own sci-fi was using cyclotrons and long linac (linear accelerator) stage to try to push atomic nuclei to relativistic speeds
    (and generate thrust mostly by the energy required to accelerating the nuclei).

    Comparably, thrust would be modest, but the amount of propellant used
    could be very small. Harder part is powering the thing...

    Main candidates are nuclear and fusion power. With fusion, one could
    separate out the He nuclei and use these. With nuclear, CO2 makes more
    sense as a working fluid and propellant (after first decomposing it into
    C and O nuclei).

    While fusion would be the preferable option, nuclear would be easier to
    build in practice (probably a molten-salt breeder reactor with CO2 as
    the working fluid).

    No light water reactors though, these would just suck...


    For high-speed travel, would likely need to have a lot of ablative
    shielding. My current leaning here basically being to put a very big
    chunk of graphite on the front of the ship.

    Could use magnetic deflection, but this is likely to result in more
    energy cost and drag than essentially using a big chunk of graphite and
    maybe some permanent magnets (to slightly reduce ablation rate). Likely
    if using permanent magnets, there would be an iron structure inside the
    big chunk of graphite.

    Say, this to deal with the front of the ship being hit with a stream of hydrogen atoms potentially traveling at relativistic speeds, etc.

    Would likely be pointy (sort of like the tip of a pencil) to reduce drag.

    It is possibly that the iron structure (inside the graphite) could
    in-part be magnetized using part of the field strength of the magnets in
    the cyclotrons (say, for example, using big neodymium magnets or similar
    to limit energy cost).




    Comparably, variations of nuclear thermal rockets have tradeoffs:
    Orders of magnitude more thrust;
    But, lower maximum exhaust velocity (so, more propellant is needed);
    ...

    So, these are more likely to be able to get going good and fast, within
    a more reasonable timeframe, but the problem will then become how far
    you can get before running out of fuel. Assuming human passengers
    though, time and human lifespan become major limiting factors.

    Also, for humans, you need a whole lot of stuff to keep them alive and comfortable, etc. So, you need a whole lot of thrust to get things
    moving quickly.

    So, in this sense, nuclear thermal makes more sense.



    Vs, say, trying to design the engine around the assumption of a
    "continuous burn" likely measured in centuries (for a ship with a
    comparably modest payload).

    Ion engines are also possible, but downside of ion engines is that they
    would have higher propellant use rate than a cyclotron+linac while also
    giving significantly less thrust relative to the amount of propellant used.


    Downside of cyclotron+linac though would be that it would be more energy hungry and necessarily significantly larger/bulkier. Such a ship would
    be impractical for ground launch so would need to be assembled in space.

    Well, unless a very small version were built, likely with the
    (comparably long) linac tail in a folded up form, which then unpacks and assembles itself once in space.

    Pretty much no one else that I am aware of is pushing for a
    cyclotron+linac based propulsion system though...


    But, then again:
    Less viability for manned interplanetary travel (nuclear powered plasma thrusters make more sense for this);
    Less viability for unmanned interplanetary travel (domain likely better
    suited to ion engines);
    Not viable for manned interstellar travel (limited payload, using it for
    a manned mission would likely result in everyone expiring long before it reached the destination).

    So, it is mostly (at best) left as an option for unmanned interstellar
    travel.


    Well, there are other ideas around, like using very small probes, and accelerating them using huge lasers, but downside is that these would
    have no way to decelerate.

    With other strategies, one basically needing to flip the ship around and spending the latter half of the trip decelerating (say, with the "butt"
    end of the ship having the ablative graphite shield in a more concave
    shape in an attempt to maximize the drag profile).




    One traditional idea has also been "generation ships", but the
    practicality of this is debatable IMO.

    Wrote a bit on this, but decided to leave this out.

    Seems like whatever approach to generation ships is taken, it still
    needs to be able to reach the target within a few centuries to have much
    of any real hope of success (1k years, very low probability; after 10k
    years, "inescapable problems" would make longer-term survival almost impossible, *1).

    *1: Running out of usable fissile materials; running out of nitrogen
    (would need to initially stock up on nitrogen in the form of C-14, which
    could decay to replace that lost over time, but by 10k years, much of
    the C14 would be gone; and long term survival is kinda gone if nitrogen
    runs out), ... Stocking up on more thorium and C-14 and similar would ultimately only delay the issues.


    Granted, one could reach further out via "system hopping", say, if each
    colony can spot another system with potentially habitable target worlds,
    say, within 25ly .. 50ly or similar, could spread pretty far over time.

    Still none of the "galactic civilization" stuff (as I see it, the
    formation of a cohesive civilization spanning multiple star systems is
    most likely essentially impossible; absent the development of FTL travel
    and communication, etc, which is itself most likely impossible).

    ...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Feb 24 06:44:31 2026
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:
    My own looking into this led me to realize that the number-space is
    modulo (relative to a cone with a spiral that repeats with every power-of-2), and that all the odd steps only ever increment the phase
    angle by 1, or fall down to a lower level (whenever the phase angle is
    not incremented).

    In effect, each odd step jumps to the opposite side of the cone, but slightly increments the angle. Since the phase-angle path can never move backwards, it can't touch any previous paths, thus, no loops can happen
    this way.


    The only way a cycle could exist is if (somehow) there were an
    impassable wall where pretty much *no* value could descend unless it
    hits the power-of-2 descent path, and then somehow miss this path as well.

    But, since "3*n+1" results in the minimum forward step, this means that
    one of the sides *has* to hit the power-of-2 descent cases at some point.

    The same argument would hold for the 3n-1 problem, where there is
    more than one cycle.

    [...]
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Feb 24 01:38:05 2026
    From Newsgroup: comp.arch

    On 2/24/2026 12:44 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:
    My own looking into this led me to realize that the number-space is
    modulo (relative to a cone with a spiral that repeats with every
    power-of-2), and that all the odd steps only ever increment the phase
    angle by 1, or fall down to a lower level (whenever the phase angle is
    not incremented).

    In effect, each odd step jumps to the opposite side of the cone, but
    slightly increments the angle. Since the phase-angle path can never move
    backwards, it can't touch any previous paths, thus, no loops can happen
    this way.


    The only way a cycle could exist is if (somehow) there were an
    impassable wall where pretty much *no* value could descend unless it
    hits the power-of-2 descent path, and then somehow miss this path as well. >>
    But, since "3*n+1" results in the minimum forward step, this means that
    one of the sides *has* to hit the power-of-2 descent cases at some point.

    The same argument would hold for the 3n-1 problem, where there is
    more than one cycle.


    At least going by "the power of imagination" (would require testing to confirm):

    "3*n-1" would be different in that it neither reliably advances the
    phase angle, nor reliably moves it backwards. So, one is likely to end
    up with pairs of odd-steps that loop back to the same position, creating infinite cycles (so, it is not the same geometry).


    "3*n-3" might be stable (may appear to behave similarly to 3*n+1), but
    would instead advance the phase angle in the opposite direction.

    "3*n+3" should initially appear similar to "3*n+1", but would have paths
    that ascend to infinity (paths would exist which manage to miss all of
    the even steps and also miss the magic "power-of-2 auto-win" step).

    If the adjustment offset is even, then the "(3*n+1)/2" trick breaks, and
    also all paths that hit an odd number will ascend to infinity.


    Granted, may make sense to check/confirm, if these don't match up I
    might need to revisit my mental model.

    ...


    [...]


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Feb 24 03:16:04 2026
    From Newsgroup: comp.arch

    On 2/24/2026 1:38 AM, BGB wrote:
    On 2/24/2026 12:44 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:
    My own looking into this led me to realize that the number-space is
    modulo (relative to a cone with a spiral that repeats with every
    power-of-2), and that all the odd steps only ever increment the phase
    angle by 1, or fall down to a lower level (whenever the phase angle is
    not incremented).

    In effect, each odd step jumps to the opposite side of the cone, but
    slightly increments the angle. Since the phase-angle path can never move >>> backwards, it can't touch any previous paths, thus, no loops can happen
    this way.


    The only way a cycle could exist is if (somehow) there were an
    impassable wall where pretty much *no* value could descend unless it
    hits the power-of-2 descent path, and then somehow miss this path as
    well.

    But, since "3*n+1" results in the minimum forward step, this means that
    one of the sides *has* to hit the power-of-2 descent cases at some
    point.

    The same argument would hold for the 3n-1 problem, where there is
    more than one cycle.




    Then, went and took a shower, realized there was more to this:

    At least going by "the power of imagination" (would require testing to confirm):

    "3*n-1" would be different in that it neither reliably advances the
    phase angle, nor reliably moves it backwards. So, one is likely to end
    up with pairs of odd-steps that loop back to the same position, creating infinite cycles (so, it is not the same geometry).


    Imagination is mostly saying there will be a lot of fairly short loops here.

    There should probably be a quite significant number of short loops in
    this case.



    "3*n-3" might be stable (may appear to behave similarly to 3*n+1), but
    would instead advance the phase angle in the opposite direction.


    But, "3*n-3" will change the ending state, so it will not land on 1 anymore.

    Rather:
    Around 84% of the time it end land on 3;
    Around 16% of the time it end land on 0.
    Give or take... ("imagination" showing around 84% or so of paths landing
    on 3).

    Will probably still converge, just ending up at a different end state.


    All other negative offset paths would lead to underflow.

    So, can probably exclude cases where the value space goes negative.
    Though... "looks like" if these were included there is another cone on
    the other side, if one goes this way.



    "3*n+3" should initially appear similar to "3*n+1", but would have paths that ascend to infinity (paths would exist which manage to miss all of
    the even steps and also miss the magic "power-of-2 auto-win" step).

    If the adjustment offset is even, then the "(3*n+1)/2" trick breaks, and also all paths that hit an odd number will ascend to infinity.


    Further assertion 1:
    All other positive offset paths should contain paths to infinity (along
    with jostling around the ending states for those paths which do converge).

    So, the specific end-state behavior seems unique.

    Further assertion 2:
    "3*n+1" may be unique in this space.

    Or, possibly the only one that reliably converges, lands on 1, and does
    not contain cycles.



    Granted, may make sense to check/confirm, if these don't match up I
    might need to revisit my mental model.


    Still TODO, but for me, shower seemed like a bigger immediate priority.


    So, yeah, I guess I may report back later as to whether or not the
    actual math behaves the same way as it does in my imagination.

    ...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Feb 24 04:52:14 2026
    From Newsgroup: comp.arch

    On 2/24/2026 3:16 AM, BGB wrote:
    On 2/24/2026 1:38 AM, BGB wrote:
    On 2/24/2026 12:44 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:
    My own looking into this led me to realize that the number-space is
    modulo (relative to a cone with a spiral that repeats with every
    power-of-2), and that all the odd steps only ever increment the phase
    angle by 1, or fall down to a lower level (whenever the phase angle is >>>> not incremented).

    In effect, each odd step jumps to the opposite side of the cone, but
    slightly increments the angle. Since the phase-angle path can never
    move
    backwards, it can't touch any previous paths, thus, no loops can happen >>>> this way.


    The only way a cycle could exist is if (somehow) there were an
    impassable wall where pretty much *no* value could descend unless it
    hits the power-of-2 descent path, and then somehow miss this path as
    well.

    But, since "3*n+1" results in the minimum forward step, this means that >>>> one of the sides *has* to hit the power-of-2 descent cases at some
    point.

    The same argument would hold for the 3n-1 problem, where there is
    more than one cycle.






    Now, after actually testing (to confirm or disprove my imagination):

    Then, went and took a shower, realized there was more to this:

    At least going by "the power of imagination" (would require testing to
    confirm):

    "3*n-1" would be different in that it neither reliably advances the
    phase angle, nor reliably moves it backwards. So, one is likely to end
    up with pairs of odd-steps that loop back to the same position,
    creating infinite cycles (so, it is not the same geometry).


    Imagination is mostly saying there will be a lot of fairly short loops
    here.

    There should probably be a quite significant number of short loops in
    this case.


    Quite a lot of loops, but they are not quite matching the expected
    pattern (the typical loop period is longer than imagination had implied).




    "3*n-3" might be stable (may appear to behave similarly to 3*n+1), but
    would instead advance the phase angle in the opposite direction.


    But, "3*n-3" will change the ending state, so it will not land on 1
    anymore.

    Rather:
    -a Around 84% of the time it end land on 3;
    -a Around 16% of the time it end land on 0.
    Give or take... ("imagination" showing around 84% or so of paths landing
    on 3).

    Will probably still converge, just ending up at a different end state.


    Disproven:
    "3*n-3" is nowhere near stable.

    It does land on 3, but is prone to loops (imagination did not predict prevalent looping for this one).



    All other negative offset paths would lead to underflow.

    So, can probably exclude cases where the value space goes negative.
    Though... "looks like" if these were included there is another cone on
    the other side, if one goes this way.


    Not testing for this ATM.



    "3*n+3" should initially appear similar to "3*n+1", but would have
    paths that ascend to infinity (paths would exist which manage to miss
    all of the even steps and also miss the magic "power-of-2 auto-win"
    step).


    Does appear similar, but with a different ending state (lands on 3).

    But, thus far the infinity-paths are being no-show.

    Thus far, otherwise, 3*n+3 appears stable; at least for the range
    tested. This part does match mental prediction so far (but appears more
    stable than expected).

    Mental model expected existence of divergent paths to exist in this
    case, but thus far I am not seeing any.



    If the adjustment offset is even, then the "(3*n+1)/2" trick breaks,
    and also all paths that hit an odd number will ascend to infinity.


    Confirmed: Even offsets do fire off towards infinity.



    Further assertion 1:
    All other positive offset paths should contain paths to infinity (along
    with jostling around the ending states for those paths which do converge).

    So, the specific end-state behavior seems unique.


    Non-match:
    Loops (or long cycles) are more prevalent than expected;
    Paths to infinity are not.

    Testing +5 and +7, seeing a whole lot of loops, not so many divergent paths.

    Thus far, divergent paths have only appeared in one of the expected
    scenarios.


    Granted, it is possible that what my imagination was predicting as
    divergent paths are instead manifesting as long-period cycles (probably because they are hitting too many even-numbered spots to achieve
    "sufficient escape velocity").

    May need to add something to try to differentiate loops from "failed
    escape paths" where it just skips around, but doesn't represent an
    actual loop.


    Further assertion 2:
    -a "3*n+1" may be unique in this space.

    Or, possibly the only one that reliably converges, lands on 1, and does
    not contain cycles.


    Thus far, it is looking like "3*n+1" is likely fairly unique here.




    Granted, may make sense to check/confirm, if these don't match up I
    might need to revisit my mental model.


    Still TODO, but for me, shower seemed like a bigger immediate priority.


    So, yeah, I guess I may report back later as to whether or not the
    actual math behaves the same way as it does in my imagination.


    Not quite as close of a match as I would like.


    So, seems my imagination model is still subject to doubt in this case...

    Given my seeming inability to sufficiently accurately predict the
    behavior of adjusting the offset to a sufficient degree of accuracy; it
    is possible my interpretation of the "3*n+1" case may not be
    sufficiently accurate.

    Nothing wildly different from my imagination happened, but it is still
    not quite close enough it would seem.

    ...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Feb 24 11:16:40 2026
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:

    "3*n-1" would be different in that it neither reliably advances the
    phase angle, nor reliably moves it backwards. So, one is likely to end
    up with pairs of odd-steps that loop back to the same position, creating infinite cycles (so, it is not the same geometry).

    What is this phase angle? I am only familiar with that term for
    complex numbers.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Feb 24 13:19:59 2026
    From Newsgroup: comp.arch

    On 2/24/2026 5:16 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:

    "3*n-1" would be different in that it neither reliably advances the
    phase angle, nor reliably moves it backwards. So, one is likely to end
    up with pairs of odd-steps that loop back to the same position, creating
    infinite cycles (so, it is not the same geometry).

    What is this phase angle? I am only familiar with that term for
    complex numbers.


    I had noted that it follows a pattern, but primarily if one splits it up
    into a spiral that itself resembles a floating point number.

    Though, splitting it into real and imaginary components could also make
    sense as well, but I was thinking of something more like polar
    coordinates though.


    So, say, level=exponent, phase=mantissa. But, differing some in that in
    this context it is more continuous (like a log scale number), but some
    of the behavioral patterns are more like those of a floating point number.

    So, say:
    level ~= log2(n)
    phase ~= n/(2**floor(level))
    So, say, imagine it represents a value between 1.0 and 1.9999...

    Each crossing of a power of 2 is an immediate drop to the "win".
    Each odd step will either land on a spot that is equal (and drops a
    level) or increments the phase (or, roughly like incrementing the ULP in
    a floating point number, if the size of the mantissa were proportional
    to the value of the exponent).

    It seems, the behavior only manifests in the desired ways if:
    Relative phase increases for each odd step;
    Increases by the minimum possible amount
    (what results from the +1 adjustment).

    As noted, the phase effectively exists on two points on the circle at
    any moment:
    phaseA = phase
    phaseB = phase + 0.5

    Or, if one wants to see it as a circle in radians, could be like:
    angle=(phase-1.0)*(2*PI);
    Or, in degrees:
    angle=(phase-1.0)*(360);
    Phase0 and Phase1 are always at 180 degrees relative to each other.

    And, at each odd step, these two points move very slightly on the
    circle. The relative angle relative to each side of the circle, relative
    to the starting point, typically remaining small (relative to the
    starting value).

    Overall alignment of the two points can be almost anywhere in the
    circle, except that it can't cross the point at 0 degrees (or
    powers-of-2) because these drop straight to 1.

    Or, in polar coordinate terms:
    rho = level
    phi = angle


    My previous attempt at a graph though is wonky in that the numbers are following a log spacing around the circle, rather than an linear
    spacing. May need to fix this, since the explanation of the two points
    being 180 degrees out of phase looks wonky if the diagram doesn't show
    these points as 180 degrees apart.



    Failing to increase leads to loops (leading to the same spot).
    Trying to go backwards leads to loops and instabilities (divergent
    paths, or, "more backwards" leads into negative-land).
    Trying to go forwards more than the minimum amount also leads to
    instabilities (it results in it stepping over downwards paths).


    I initially thought these instabilities would head towards infinity, but
    it seems more like they just head to a larger orbit and get stuck
    bouncing around the circle. I would need to do more looking into it to
    figure out whether they get stuck in a loop, or continue to climb (but
    more gradually). My mental model implies that these paths should
    continue to climb, but that increasing the constant bias possibly causes
    them to ascend faster.

    If they are more prone to forming actual closed loops (rather than
    climbing indefinitely), this would be inconsistent with my mental model,
    but seems I would need to do more testing here.

    Though, loops can form if the ascent is slow enough that it manages to complete a full rotation of the circle without escaping to a higher
    orbit (if it manages to escape the original orbit without forming a
    loop, it should trend upwards to infinity).


    Once the relative phase (from the starting point) exceeds 180 degrees
    (or the two points covering a full circle), if the path has not ascended
    to a higher level, a closed loop will form.

    But, as noted, with the +1 step, the above scenario (for loop forming)
    becomes impossible.




    So, in this case, the "3*n+1" step would appear to be uniquely stable in
    this particular configuration.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 26 01:27:54 2026
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 22/02/2026 19:38, BGB wrote:

    I was initially seeing it more like a very large hash table.
    But, I have since realized it is not a hash table, it is a more
    organized pattern.

    As far as we know, the result of apply the Collatz function (calculating
    the number of steps to reach 1, if it ever does) follows a pattern statistically but not any clear pattern for individual values. There
    are a lot of mathematical functions like that - a good example being the number of prime factors in a number. You can draw graphs and visually
    see nice patterns - some of which can be proven, others are still
    unproven and may not apply over a big enough scale. Sometimes there are tricks or hints to ways to improve your chances when guessing which
    values might be outliers.

    Given by definition: even: n = n/2
    odd: n = 3|un+1 {which is even}

    The most one can get out of an iteration is n|u1.5 {for large n} while
    the most one can loose from an iteration is n/2

    So, there will be vanishingly small numbers of n which see the odd
    iteration enough times to out weight the even iterations. Thus one
    would expect the conjecture is true.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Feb 26 09:21:11 2026
    From Newsgroup: comp.arch

    On 26/02/2026 02:27, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 22/02/2026 19:38, BGB wrote:

    I was initially seeing it more like a very large hash table.
    But, I have since realized it is not a hash table, it is a more
    organized pattern.

    As far as we know, the result of apply the Collatz function (calculating
    the number of steps to reach 1, if it ever does) follows a pattern
    statistically but not any clear pattern for individual values. There
    are a lot of mathematical functions like that - a good example being the
    number of prime factors in a number. You can draw graphs and visually
    see nice patterns - some of which can be proven, others are still
    unproven and may not apply over a big enough scale. Sometimes there are
    tricks or hints to ways to improve your chances when guessing which
    values might be outliers.

    Given by definition: even: n = n/2
    odd: n = 3|un+1 {which is even}

    The most one can get out of an iteration is n|u1.5 {for large n} while
    the most one can loose from an iteration is n/2

    So, there will be vanishingly small numbers of n which see the odd
    iteration enough times to out weight the even iterations. Thus one
    would expect the conjecture is true.

    Hand-waving arguments like that are far from mathematical proofs. But certainly that is the logic that suggests it makes more sense to work
    towards proving the conjecture, rather than disproving it - most mathematicians familiar with the work done on it expect that it is true.
    Working towards disproves - trying to find counter-examples, or a
    proof of the existence of counter-examples - can still be helpful.
    Figuring out why the attempt to disprove it failed can lead to insights.
    (Simple pass/fail trial and error in itself is rarely helpful in this manner, but some ideas could come from looking at which numbers lead to unusally large growth before dropping to 1.)

    Some of the interesting proofs in this are are formalisations of what
    you said here - demonstrating that /if/ there are exceptions to the conjecture, then they are "vanishingly rare". That is, bounds can be
    found that limit the maximum number of counter-examples in any given
    interval.

    Still, there is no proof that the conjecture is true. It may be one of
    these statements that is true but unprovable - though of course we won't
    be able to prove that it is unprovable.

    And it may turn out not to be true, but counter-examples are very large.
    This is not unknown in number theory conjectures. Merten's conjecture (which you can look up on Wikipedia - don't ask me to explain it,
    because it is beyond me) is a disproven conjecture about natural numbers
    where it is known that the first counter-example is between 10 ^ 16 and
    10 ^ (8.5 * 10 ^ 18).

    Certainly we can be confident that no one here is likely to find a counter-example by testing numbers beyond 2 ^ 71. But that does not
    mean people can't have fun trying to do so, or thinking up new and
    efficient ways to run the tests!

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Feb 26 19:31:20 2026
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-02-15 7:14 a.m., David Brown wrote:
    On 15/02/2026 09:58, Robert Finch wrote:

    I wonder how difficult it would be to implement a parallel tester in >>>>> an FPGA. It looks simple enough and there are hundreds of DSP blocks >>>>> available to use. Could test in blocks of hundreds of numbers at a time. >>>>> Running at 100 MHz * 200 tests at a time = how long to get to 2^71?


    One of the nice things about this is that you don't need DSP blocks - it >>> Yeah, I figured that out after coding. 3n+1 = n+n+n+1

    3n+1 = (n<<1)+n+1

    one add with carry in and n gets wired in directly and also shifted
    up by 1 (no logic shifter).

    Even better:

    (3n+1)/2 = n + (n>>1) + 1

    I.e two iterations for the cost of one.

    For b = n + (n>>1) + 1, for odd n, a bit of a micro-optimization is
    possible because lowest-order bit of the result equals bit number
    1 from n.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2