• Re: bits and bytes, Keeping other stuff with addresses

    From John Levine@21:1/5 to monnier@iro.umontreal.ca on Wed Dec 4 03:39:07 2024
    It appears that Stefan Monnier <monnier@iro.umontreal.ca> said:
    Yes, but that was a misunderstanding. I'm not suggesting that
    load/store instructions can access things at any bit position and any
    bit size. Any load or store with a pointer whose last 3 bits is not 0 would >>>presumably signal en error.
    Seems like an odd place to put what are in practice just flag bits.

    It's a very natural one, tho.

    I don't see why. We agree that nobody expects bit addressing to work, so
    in fact those are flag bits. You can use them to point at bits if you want
    but there's no architectural or practical reason to do so.

    Byte addressing is somewhat arbitrary
    (why 8 bits, why not 16 or 4 or 6 or 9 ...?), whereas bit-addressing has
    some logic to it (fractional bit addressing would be hard to define).

    On the PDP-10 you could have any byte size you wanted, but in fact it was always
    7 bits since that was how many ASCII needed and five 7-bit characters fit in a 36 bit word with one bit left over that was used as a flag that the word contained a five digit line number.

    At this point the reason you use 8 bit bytes is that everyone else uses 8 bit bytes. In about 1980 BBN had a 16 bit byte addressed machine called the C30 that
    ran Unix and someone had the bright idea to expand the address space from 16 to 20 bits on the larger C70 by making the bytes 10 bits rather than 8. I talked to
    someone who programmed it and told me that it was miserable because the C code was full of implicit assumptions that bytes were 8 bits.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Dec 4 10:36:41 2024
    Seems like an odd place to put what are in practice just flag bits.
    It's a very natural one, tho.
    I don't see why.

    The next sentence after the one you cited explained my reasoning.

    We agree that nobody expects bit addressing to work, so in fact those
    are flag bits.

    In practice, yes.

    You can use them to point at bits if you want but there's no
    architectural or practical reason to do so.

    Indeed, which is why we could use them for tag bits.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Dec 5 00:39:22 2024
    According to Scott Lurndal <slp53@pacbell.net>:
    8-bits won because it was enough (at the time of inception {IBM
    360--1963})

    8-bits is also a convenient multiple of four bits, which was common
    in many machines prior to the 360. The hardware in burroughs
    BCD machines could automatically add/remove the zone digit (bits <7:4>) >during data movement.

    Good point. S/360 packed decimal puts two digits in each byte with the low "digit" of the low order byte being the sign. If they had six bit bytes, the only other size they seriously considered, they'd either waste 1/3 of the
    bits with a digit per byte, or need what would then have been an impossibly complex encoding.

    Several decades later, decimal floating point used such a complex encoding.




    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jonathan Thornburg@21:1/5 to mitchalsup@aol.com on Sat Dec 21 23:22:35 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    FORTRAN COMMON blocks require misaligned accesses to double precision
    data.
    R E Q U I R E in that it is neither optional nor wise to emulate with exceptions. It is just barely tolerable using LD/ST Left/Right
    instructions
    out of the compiler.

    I, personally, went through enough PAIN with misalignment, that over
    time my mood swung from "aligned only" to "completely misaligned"::
    a) because there is no performant* SW workaround
    b) it is SO easy to fix in HW.
    c) once fixed in HW, any SW burden is so small as to be barely
    ..measurable.

    I'm not so sure (b) is true. Some cases are moderately easy to handle
    in hardware (e.g., misaligned loads that stay within a single L1 D-cache
    line), but some cases are harder (e.g., misaligned writes that cross L1
    D-cache line boundaries) and might need a microcode trap (awkward if the
    design wasn't otherwise using microcode). And some cases are even harder (e.g., misaligned writes crossing L1 D-cache line boundaries where the
    two lines are owned by different CPUs in a cache-coherent multiprocessor)
    and might need a millicode trap. And some cases may require going all the
    way up to the OS (e.g., misaligned writes that cross virtual-memory-page boundaries where one page is ok but the other is non-resident).

    So, allowing this in the architecture has several costs:
    * extra hardware implementation effort to make sure the "hardware" cases
    don't cost an extra gate delay or two on some critical path
    * extra complexity and debugging time in hardware and in system software
    (think about writing and *debugging* and *verifying* microcode/millicode
    trap handlers for all those messy write-crossing-cache/page-boundary
    cases, especially their interactions with multiprocessor cache coherency)
    * this extra effort means a longer design time and/or greater design cost,
    and hence (so long as the state-of-the-art of competing systems is still
    steadily improving with time) that means a net lower price/performance
    relative to competing systems

    And, because of the traps and their overheads (which will likely differ significantly across different implementations of the same architecture,
    e.g., different multiprocessor cache-coherency protocols), any code that actually *uses* unaligned accesses -- especially unaligned writes -- isn't performance-portable unless the actual dynamic frequency of unaligned operations is very low.

    So yes, allowing unaligned access does help "dusty deck" Fortran code...
    but it comes at a significant cost.

    --
    -- "Jonathan Thornburg [remove -color to reply]" <jt.bhbkis@gmail-pink.com>
    on the west coast of Canada
    "the stock market can remain irrational a lot longer than you can
    remain solvent" or (probably the correct original wording) "markets
    can remain irrational a lot longer than you and I can remain solvent"
    -- A. Gary Shilling (often misattributed to John Maynard Keynes)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Jonathan Thornburg on Sun Dec 22 01:27:27 2024
    On Sat, 21 Dec 2024 23:22:35 +0000, Jonathan Thornburg wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    FORTRAN COMMON blocks require misaligned accesses to double precision
    data.
    R E Q U I R E in that it is neither optional nor wise to emulate with
    exceptions. It is just barely tolerable using LD/ST Left/Right
    instructions
    out of the compiler.

    I, personally, went through enough PAIN with misalignment, that over
    time my mood swung from "aligned only" to "completely misaligned"::
    a) because there is no performant* SW workaround
    b) it is SO easy to fix in HW.
    c) once fixed in HW, any SW burden is so small as to be barely
    ..measurable.

    I'm not so sure (b) is true. Some cases are moderately easy to handle
    in hardware (e.g., misaligned loads that stay within a single L1 D-cache line), but some cases are harder (e.g., misaligned writes that cross L1 D-cache line boundaries) and might need a microcode trap (awkward if the design wasn't otherwise using microcode). And some cases are even
    harder

    While there is no concept of Millicode or Microcode::
    There are several sequencing components::

    a) determining if the access is misaligned:: This takes 8 gates and 2
    gates
    of delay from an adder already comprising 2000 gates. The misaligned
    assertion
    comes 5-6 gates BEFORE the higher order 32-bits come out of the adder::
    I consider this part ignorable.
    b) accessing the cache optimally in the presence of misaligned accesses.
    b.1) if the access does not cross a cache port boundary, then all the
    problems are confined to the alignment of the data.
    b.2) if the access crosses a port boundary but not a line boundary,
    access 2 successive ports, and allow Aligner to sort out the problem.
    b.3) if the access crosses a line boundary but not a page boundary,
    access 2 successive ports incrementing the line address of the second
    port.
    b.4) if the access crosses a page boundary, you are going to have to
    access the cache twice, once for the first page, once for the second.

    So, only page crossing REQUIRES 2 accesses; and 99% (Made up number)
    are performed in a single cycle. {{Try that with some kind of SW
    workaround}}

    {{Oh, and BTW; this is a good place to check that the access rights
    to both pages are compatible with the rights in both PTEs.}}

    So,
    AGEN adder is 8 gates bigger out of 2,000 total gates
    Cache port control logic is 2× as big out of 90 gates
    Cache staging flip-flops in stage ALIGN is 2× as big
    LD Aligner is bigger ~1.75×
    Tag, TLB, DATA RAMs are exactly the same size and ~9× larger
    ....than of the other cache pipeline logic area
    {{And you add 25-odd gates in the Miss Buffers}}

    x86 has been doing this for 3 decades. It is well worn logic at this
    point.
    It was at AMD where I saw how easy this was for HW to simply "make the
    problem disappear" compared to all the ways SW uses to work around "not
    being able to access misaligned data". Once you have done it once, you
    have the logic and test vectors to insure you don't shoot yourself in
    the foot.

    Any competent programmer will ALIGN his data to the extend possible
    there is no reason to penalize {Compiler, assembler, linker, ld.so,...}
    just because you want to take 5 days out of design.

    So, My design:
    Aligned data is always best, Misaligned data comes at very low cost.
    SW overhead = 0

    Your design:
    Aligned data works just fine, Misaligned data is a complete nightmare throughout the entire SW stack, and causes large uncertainty in result
    deliver time. SW overhead = significant.

    How many days of SW development are required to make up for the 5 days
    of HW design to simply eradicate the problem.

    You would not buy a car without anti-lock brakes--even though you will
    only use the feature once or twice in your ownership of the vehicle !?!
    Why would you buy a CPU that is not similar?

    (e.g., misaligned writes crossing L1 D-cache line boundaries where the
    two lines are owned by different CPUs in a cache-coherent
    multiprocessor)
    and might need a millicode trap. And some cases may require going all
    the
    way up to the OS (e.g., misaligned writes that cross virtual-memory-page boundaries where one page is ok but the other is non-resident).

    Millicode is so DEC ALPHA. Fixing the problem in HW does not require
    anything but the 5 sequences I illustrated above--this amount of
    sequencers are invisible in the cache pipeline as a whole.

    So, allowing this in the architecture has several costs:
    * extra hardware implementation effort to make sure the "hardware" cases
    don't cost an extra gate delay or two on some critical path

    AMD had done all of this by 1997. {don't know about when Intel had it
    licked}

    But, yes, if you have a balls-to-the-wall pipeline (R2000) adding a gate
    of
    delay would degrade performance by ~5%. This has only been shown to be
    an
    issue when the cache pipeline is 2 stages and one is trying to get::

    Forward->AGEN->RAMS->ALIGN->resultbus in 2 cycles.

    MIPS had to use direct mapped caches to meet this timing, and had to
    sample SRAM chips on its own test head to measure if the SRAMs had
    pin timings appropriate to R2000 timings.

    Once you have set-associativity or allow for 3 cycles {note current
    Intel cores are using 5 cycles.} your argument fails.

    While your argument might succeed in 2µ-through-90nm, wires have become
    so slow that in many cases adding a gat of pure delay does not slow
    anything because the cache pipeline has been engineered O F F the/any
    critical path. So, while RISC-V persists with the 2 cycle cache pipe-
    line, the big boys have migrated to longer pipelines and build execution windows to absorb the added latencies.

    * extra complexity and debugging time in hardware and in system software
    (think about writing and *debugging* and *verifying*
    microcode/millicode
    trap handlers for all those messy write-crossing-cache/page-boundary
    cases, especially their interactions with multiprocessor cache
    coherency)

    There is N O M I L L I C O D E. There is a sequencer that can take 1
    of
    5 paths over the AGEN-CACHE-ALIGN stages of the pipeline. SW has to do
    nothing to enable this, or overcome poor/bad use of ISA.

    * this extra effort means a longer design time and/or greater design
    cost,
    and hence (so long as the state-of-the-art of competing systems is
    still
    steadily improving with time) that means a net lower price/performance
    relative to competing systems

    So does IEEE 754 floating point !! It is significantly more logic
    intensive
    than IBM or CRAY or Univac floating points. Yet, currently, some larger
    cores contain 4-8 of these floating point units (8-16 is you separate
    FMUL/FDIV from FADD/FSUB/FCMP).

    And, because of the traps

    There are N O T R A P S, no exceptions (other than expected), no
    interrupt dependencies, no mispredict repair dependencies, no
    coherence dependencies, ...

    and their overheads (which will likely differ significantly across different implementations of the same architecture, e.g., different multiprocessor cache-coherency protocols), any code that actually *uses* unaligned accesses -- especially unaligned writes --
    isn't
    performance-portable unless the actual dynamic frequency of unaligned operations is very low.

    UnTrue.

    So yes, allowing unaligned access does help "dusty deck" Fortran code...
    but it comes at a significant cost.

    Less than 0.1% is a significant cost ?!?!?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sun Dec 22 10:01:48 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sat, 21 Dec 2024 23:22:35 +0000, Jonathan Thornburg wrote:

    Any competent programmer will ALIGN his data to the extend possible
    there is no reason to penalize {Compiler, assembler, linker, ld.so,...}
    just because you want to take 5 days out of design.

    These days, the competence of many programmers can be called into
    question :-)

    ABIs, however, generally require natural alignment for types, so
    the point is somehwat moot, at least where user code is concerned.
    Consider

    typedef struct
    {
    unsigned char a;
    unsigned long b;
    } mytype;

    unsigned long add (mytype *x)
    {
    return x->a + x->b;
    }

    which gets translated into

    ldub r2,[r1]
    ldd r1,[r1,8]
    add r1,r1,r2
    ret

    so the cost for the tool chain is already spent (or is spent
    again and again if people use structs like the above). I think
    the VAX was the last major architecture which specified unaligned
    struct access.

    On Sat, 21 Dec 2024 23:22:35 +0000, Jonathan Thornburg wrote:

    So yes, allowing unaligned access does help "dusty deck" Fortran code...
    but it comes at a significant cost.

    Fortran compilers, even on machines which allow misalignment, use
    ABIs which specify alignment for COMMON blocks, in violation of
    the Fortran standard. They usually have a flag for when the
    user actually needs to have no padding.

    Code which which would not work with padding would have to be dusty
    indeed (fossilized?) if it used COMMON blocks that way. It would
    never have run on early RISCs, so it would likely have had a time
    of non-use in the 1990s when RISCs ruled in price/performance
    after mainframes and the VAX fell behind.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Jonathan Thornburg on Sun Dec 22 10:33:01 2024
    Jonathan Thornburg <jonathan@gold.bkis-orchard.net> writes:
    And some cases are even harder
    (e.g., misaligned writes crossing L1 D-cache line boundaries where the
    two lines are owned by different CPUs in a cache-coherent multiprocessor)
    and might need a millicode trap.

    Made me look up "millicode". Anything at all might need a millicode
    trap on implementations that use millicode, but I don't see any
    particular issue here that would make millicode particularly relevant.

    And some cases may require going all the
    way up to the OS (e.g., misaligned writes that cross virtual-memory-page >boundaries where one page is ok but the other is non-resident).

    Again, sure, if you access a page that is not present, the hardware
    traps to the OS to make that page present, but that's also the case
    without unaligned accesses; and with software emulation of unaligned
    accesses, as on Alpha with, e.g., UAC_NOPRINT, every unaligned access
    traps to the OS. Ist this better?

    And, because of the traps and their overheads (which will likely differ >significantly across different implementations of the same architecture, >e.g., different multiprocessor cache-coherency protocols), any code that >actually *uses* unaligned accesses -- especially unaligned writes -- isn't >performance-portable unless the actual dynamic frequency of unaligned >operations is very low.

    Possible, but hardly relevant. E.g., I am interested in such things
    and I was completely unaware of the penalties of unaligned stores
    until I measured them: <http://al.howardknight.net/?ID=143135464800> <https://www.complang.tuwien.ac.at/anton/unaligned-stores/>. I expect
    that even among performance-conscious programmers, only a small
    minority knows more than to avoid them, when it's cheaply possible.

    Maybe some (probably more than are aware of actual costs) think that
    they should avoid them at all cost, and then use e.g., bytewise
    approaches for hashing strings than the on-average faster approaches
    that fetch string data as wide as is practical and hash that.

    So yes, allowing unaligned access does help "dusty deck" Fortran code...
    but it comes at a significant cost.

    It's not just dusty deck Fortran code.

    And the cost for not supporting unaligned accesses is higher. There's
    a reason why all surviving general-purpose architectures support
    unaligned accesses.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Koenig on Sun Dec 22 11:42:00 2024
    In article <vk8o2c$in5m$1@dont-email.me>, tkoenig@netcologne.de (Thomas
    Koenig) wrote:

    These days, the competence of many programmers can be called into
    question :-)

    ABIs, however, generally require natural alignment for types, so
    the point is somewhat moot, at least where user code is concerned.
    Consider

    Decades ago, there seems to have been a fashion among Windows application programmers for using somewhat tighter packing. For reasons that I still
    don't really understand, Microsoft's x86 C/C++ compiler provides an
    option for setting the largest alignment allowed. If you set it to two
    bytes, shorts still get natural alignment, but everything larger is
    aligned to two-byte boundaries.

    After we'd explained to the third customer that this would not work with
    our libraries, it got a paragraph in the documentation, explicitly saying
    that it you used it, the libraries would crash. No attempts to do this
    have reached second-line support since then. Sometimes, a totally
    unvarnished warning is the right thing to use.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sun Dec 22 11:06:26 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    ABIs, however, generally require natural alignment for types, so
    the point is somehwat moot, at least where user code is concerned.

    Let's weaken this to "good ABIs require natural alignment for basic
    types". Intel has erred in both directions:

    * In its IA32 ABI, they required 4-byte alignment for 8-byte FP data,
    while their hardware traps (with the AC flag set) when accessing an
    8-byte FP value on a 4-byte-aligned address that is not
    8-byte-aligned. This makes the AC flag useless.

    * They made SSE/SSE2 instructions require 16-byte alignment
    irrespective of the basic data type; as a result, at leas one AMD64
    ABI requires 16-byte alignment of the stack on call boundaries,
    which means that many functions have to adjust the stack pointer in
    order to reach that alignment (the CALL instruction changes the
    stack pointer by 8). That's apparently based on the theory that
    using load and op and movdqa is faster than movdqu, which is
    questionable <https://www.complang.tuwien.ac.at/anton/autovectors/>.

    I think
    the VAX was the last major architecture which specified unaligned
    struct access.

    The 4-byte alignment for 8-byte floats on IA-32 also holds for struct
    fields.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sun Dec 22 21:04:06 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Aligned data is always best, Misaligned data comes at very low cost.
    SW overhead = 0

    Thinking about this for a bit... for a clean-sheet architecture
    like My66000, could there actually be an advantage to do
    struct layout like the VAX did, with everything aligned on byte
    boundaries? On the plus side, there would be lower memory use.
    On the minus side... very low cost, as you wrote above.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sun Dec 22 23:32:41 2024
    On Sun, 22 Dec 2024 21:04:06 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Aligned data is always best, Misaligned data comes at very low cost.
    SW overhead = 0

    Thinking about this for a bit... for a clean-sheet architecture
    like My66000, could there actually be an advantage to do
    struct layout like the VAX did, with everything aligned on byte
    boundaries? On the plus side, there would be lower memory use.
    On the minus side... very low cost, as you wrote above.

    The correct spelling is "My 66000" this may come up later if
    my trademark is accepted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Dec 26 13:38:04 2024
    Aligned data is always best, Misaligned data comes at very low cost.
    SW overhead = 0
    Thinking about this for a bit... for a clean-sheet architecture
    like My66000, could there actually be an advantage to do
    struct layout like the VAX did, with everything aligned on byte
    boundaries?

    I highly doubt it. Making unaligned accesses work efficiently is great,
    but that's no reason to abuse them:

    - Going back to Mitch's description, in case B.1 the misalignment is
    truly "free", but for B.2, B.3, and B.4 the misalignment does come at
    a cost, not necessarily visible in terms of cycles but at least in
    terms of cache bandwidth, which can have an impact on overall speed
    and energy use.
    Of course, properly aligning your data will also come with costs,
    but "packed structs" don't come totally free.
    - AFAIK most efforts to support concurrency take it for granted that
    atomic accesses are supported only when properly aligned.

    I expect it's at least as easy (and more portable) to reorder fields by
    order of (expected) size to avoid excessive padding in aligned data,
    than it is to add manual padding/alignment to avoid the cost of
    misalignment in "packed structs".


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to monnier@iro.umontreal.ca on Thu Dec 26 16:31:07 2024
    On Thu, 26 Dec 2024 13:38:04 -0500, Stefan Monnier
    <monnier@iro.umontreal.ca> wrote:

    Aligned data is always best, Misaligned data comes at very low cost.
    SW overhead = 0
    Thinking about this for a bit... for a clean-sheet architecture
    like My66000, could there actually be an advantage to do
    struct layout like the VAX did, with everything aligned on byte
    boundaries?

    I highly doubt it. Making unaligned accesses work efficiently is great,
    but that's no reason to abuse them:

    - Going back to Mitch's description, in case B.1 the misalignment is
    truly "free", but for B.2, B.3, and B.4 the misalignment does come at
    a cost, not necessarily visible in terms of cycles but at least in
    terms of cache bandwidth, which can have an impact on overall speed
    and energy use.
    Of course, properly aligning your data will also come with costs,
    but "packed structs" don't come totally free.
    - AFAIK most efforts to support concurrency take it for granted that
    atomic accesses are supported only when properly aligned.

    I expect it's at least as easy (and more portable) to reorder fields by
    order of (expected) size to avoid excessive padding in aligned data,
    than it is to add manual padding/alignment to avoid the cost of
    misalignment in "packed structs".

    It would have to be done manually in C or C++ because they don't
    permit the compiler to reorder the fields of a struct.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Dec 26 21:59:05 2024
    On Thu, 26 Dec 2024 18:38:04 +0000, Stefan Monnier wrote:

    Aligned data is always best, Misaligned data comes at very low cost.
    SW overhead = 0
    Thinking about this for a bit... for a clean-sheet architecture
    like My66000, could there actually be an advantage to do
    struct layout like the VAX did, with everything aligned on byte
    boundaries?

    I highly doubt it. Making unaligned accesses work efficiently is great,
    but that's no reason to abuse them:

    Agreed, highest performance comes when there are as few misaligned
    memory references as possible, and when the penalty of misalignedness
    is small--but DO NOT ABUSE this freedom.

    - Going back to Mitch's description, in case B.1 the misalignment is
    truly "free", but for B.2, B.3, and B.4 the misalignment does come at
    a cost, not necessarily visible in terms of cycles but at least in
    terms of cache bandwidth, which can have an impact on overall speed
    and energy use.
    Of course, properly aligning your data will also come with costs,

    Should be only memory footprint.

    but "packed structs" don't come totally free.

    There is NO REASON to make them slower than doable.

    - AFAIK most efforts to support concurrency take it for granted that
    atomic accesses are supported only when properly aligned.

    And you don't want multiple locks in the same cache line.

    I expect it's at least as easy (and more portable) to reorder fields by
    order of (expected) size to avoid excessive padding in aligned data,
    than it is to add manual padding/alignment to avoid the cost of
    misalignment in "packed structs".


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)