• Re: Missing MCA error messages for bad ECC

    From G. Paul Ziemba@pz-freebsd-stable@ziemba.us to muc.lists.freebsd.stable on Mon Feb 2 08:16:37 2026
    From Newsgroup: muc.lists.freebsd.stable

    Bob,

    thanks for your suggestions.

    The motherboard is a plain X11SCA (no -F ipmi)

    I don't know of a way to read the power supply voltages in software
    while FreeBSD is running, but I did reboot into the BIOS setup and
    read voltages there, and they look normal to me:

    VCPU: 1.136
    VDIMM: 1.224
    12V: 12.233
    5VCC: 5.184
    3.3V_DL: 3.327
    3.3VCC: 3.424
    VSB: 3.328
    VBAT: 3.104
    VCC1_8_DL_PCM: 1.816

    The BIOS versions are given as:

    "ver 1.2 Build Date 12/5/19" near the top of the screen; and
    "version 2.19.0045 (c) [AMI]" at the bottom of the screen

    I didn't see a setting that (apparently to me) might control how
    events might be filtered, but there WAS an event log that had
    completely filled up with messages of the form:

    <datetime> smbios 0x02 DIMMB1

    with many for DIMMB1 and DIMMB2. I haven't found any documentation yet
    of "0x02" other than a few online posts calling it either a single-bit
    or a multi-bit ECC memory error.

    I'm still favoring a diagnosis of two bad DIMMs; I just wish there were
    a way to cause these errors to show up in FreeBSD somewhere so I could
    detect them on a running system.


    On Sun, Feb 01, 2026 at 08:30:56PM +0000, Bob Bishop wrote:
    Hi,

    On 1 Feb 2026, at 16:35, G. Paul Ziemba <pz-freebsd-stable@ziemba.us> wrote:

    OS: 14.2-STABLE as of 250403

    I seem to have at least one bad ECC DIMM

    Check the power supply voltages are within tolerance if you haven???t already.

    and was expecting to see MCA
    messages in /var/log/messages or to the console (which I have recently redirected to /var/log/console.log via syslog.conf:

    console.info /var/log/console.log

    but I can't find anything in any of my logs. Why am I not seeing them?

    If you have the -F variant of the board that supports IPMI, it may be that the BMC is capturing the errors so check the BMC event log. Possibly there is a setting on the BMC to control what gets passed to MCA.

    Also check the BIOS event logging; I don???t see settings in the BIOS to control MCA events.

    And check the BIOS version is up to date.

    Background:

    Motherboard: Supermicro X11SCA
    CPU: Xeon E-2176G
    Chipset: C246
    Memory: 4x SK Hynix HMA82GU7CJR8N-VK (16GB ECC)

    Bios reports ECC on its startup screen and dmidecode reports

    Total Width: 72 bits
    Data Width: 64 bits

    for each of the dimms.

    Amanda started reporting checksum errors on large backup files in its holding disk. I discovered that a large file (200GB) on any of three
    disks on this system yields different sha512sum values every time I
    run it on the same file. SMART data looks OK on all disks.

    memtest86+ finds three bad spots in memory, at 42G, 47G and 53G. I have 4x16GB dimms installed, so I think that corresponds to two bad dimms.

    % sysctl hw.mca
    hw.mca.cmc_throttle: 60
    hw.mca.force_scan: 0
    hw.mca.interval: 300
    hw.mca.maxcount: -1
    hw.mca.count: 0
    hw.mca.erratum383: 0
    hw.mca.intel6h_HSD131: 0
    hw.mca.amd10h_L1TP: 1
    hw.mca.log_corrected: 1
    hw.mca.enabled: 1

    Thanks for any insights.
    --
    G. Paul Ziemba
    FreeBSD unix:
    8:31AM up 2 days, 14:38, 11 users, load averages: 0.71, 0.43, 0.39



    --
    Bob Bishop t: +44 (0)118 940 1243
    rb@gid.co.uk m: +44 (0)783 626 4518





    --
    G. Paul Ziemba
    FreeBSD unix:
    7:51AM up 35 mins, 2 users, load averages: 0.32, 0.56, 0.47


    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Jan Martin Mikkelsen@janm@transactionware.com to muc.lists.freebsd.stable on Mon Feb 2 21:27:30 2026
    From Newsgroup: muc.lists.freebsd.stable

    Hi,
    Possibly a silly question. Have you tried sysutils/mcelog?
    Regards,
    Jan M.
    On 2. Feb 2026, at 17:16, G. Paul Ziemba <pz-freebsd-stable@ziemba.us> wrote:

    Bob,

    thanks for your suggestions.

    The motherboard is a plain X11SCA (no -F ipmi)

    I don't know of a way to read the power supply voltages in software
    while FreeBSD is running, but I did reboot into the BIOS setup and
    read voltages there, and they look normal to me:

    VCPU: 1.136
    VDIMM: 1.224
    12V: 12.233
    5VCC: 5.184
    3.3V_DL: 3.327
    3.3VCC: 3.424
    VSB: 3.328
    VBAT: 3.104
    VCC1_8_DL_PCM: 1.816

    The BIOS versions are given as:

    "ver 1.2 Build Date 12/5/19" near the top of the screen; and
    "version 2.19.0045 (c) [AMI]" at the bottom of the screen

    I didn't see a setting that (apparently to me) might control how
    events might be filtered, but there WAS an event log that had
    completely filled up with messages of the form:

    <datetime> smbios 0x02 DIMMB1

    with many for DIMMB1 and DIMMB2. I haven't found any documentation yet
    of "0x02" other than a few online posts calling it either a single-bit
    or a multi-bit ECC memory error.

    I'm still favoring a diagnosis of two bad DIMMs; I just wish there were
    a way to cause these errors to show up in FreeBSD somewhere so I could
    detect them on a running system.


    On Sun, Feb 01, 2026 at 08:30:56PM +0000, Bob Bishop wrote:
    Hi,

    On 1 Feb 2026, at 16:35, G. Paul Ziemba <pz-freebsd-stable@ziemba.us> wrote:

    OS: 14.2-STABLE as of 250403

    I seem to have at least one bad ECC DIMM

    Check the power supply voltages are within tolerance if you haven???t already.

    and was expecting to see MCA
    messages in /var/log/messages or to the console (which I have recently
    redirected to /var/log/console.log via syslog.conf:

    console.info /var/log/console.log

    but I can't find anything in any of my logs. Why am I not seeing them?

    If you have the -F variant of the board that supports IPMI, it may be that the BMC is capturing the errors so check the BMC event log. Possibly there is a setting on the BMC to control what gets passed to MCA.

    Also check the BIOS event logging; I don???t see settings in the BIOS to control MCA events.

    And check the BIOS version is up to date.

    Background:

    Motherboard: Supermicro X11SCA
    CPU: Xeon E-2176G
    Chipset: C246
    Memory: 4x SK Hynix HMA82GU7CJR8N-VK (16GB ECC)

    Bios reports ECC on its startup screen and dmidecode reports

    Total Width: 72 bits
    Data Width: 64 bits

    for each of the dimms.

    Amanda started reporting checksum errors on large backup files in its
    holding disk. I discovered that a large file (200GB) on any of three
    disks on this system yields different sha512sum values every time I
    run it on the same file. SMART data looks OK on all disks.

    memtest86+ finds three bad spots in memory, at 42G, 47G and 53G. I have
    4x16GB dimms installed, so I think that corresponds to two bad dimms.

    % sysctl hw.mca
    hw.mca.cmc_throttle: 60
    hw.mca.force_scan: 0
    hw.mca.interval: 300
    hw.mca.maxcount: -1
    hw.mca.count: 0
    hw.mca.erratum383: 0
    hw.mca.intel6h_HSD131: 0
    hw.mca.amd10h_L1TP: 1
    hw.mca.log_corrected: 1
    hw.mca.enabled: 1

    Thanks for any insights.
    --
    G. Paul Ziemba
    FreeBSD unix:
    8:31AM up 2 days, 14:38, 11 users, load averages: 0.71, 0.43, 0.39



    --
    Bob Bishop t: +44 (0)118 940 1243
    rb@gid.co.uk m: +44 (0)783 626 4518






    --
    G. Paul Ziemba
    FreeBSD unix:
    7:51AM up 35 mins, 2 users, load averages: 0.32, 0.56, 0.47

    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Chris@bsd-lists@bsdforge.com to muc.lists.freebsd.stable on Mon Feb 2 13:23:10 2026
    From Newsgroup: muc.lists.freebsd.stable

    --=_ac7f1ffec62d053661b333e8a5b6085a
    Content-Transfer-Encoding: 7bit
    Content-Type: text/plain; charset=US-ASCII;
    format=flowed

    On 2026-02-02 08:16, G. Paul Ziemba wrote:
    Bob,

    thanks for your suggestions.

    The motherboard is a plain X11SCA (no -F ipmi)

    I don't know of a way to read the power supply voltages in software
    while FreeBSD is running, but I did reboot into the BIOS setup and
    read voltages there, and they look normal to me:

    VCPU: 1.136
    VDIMM: 1.224
    12V: 12.233
    5VCC: 5.184
    3.3V_DL: 3.327
    3.3VCC: 3.424
    VSB: 3.328
    VBAT: 3.104
    VCC1_8_DL_PCM: 1.816

    I'd just like tg mention here, that while the voltages may read
    within an expected range. It will not inform you of AC bleed. IOW
    failing diodes will leak AC. Which will result in (eventual) component
    failure. I've tossed many a PSU for just this reason. If you happen to
    have a spare around. It'd make it pretty to test this.

    --Chris


    The BIOS versions are given as:

    "ver 1.2 Build Date 12/5/19" near the top of the screen; and
    "version 2.19.0045 (c) [AMI]" at the bottom of the screen

    I didn't see a setting that (apparently to me) might control how
    events might be filtered, but there WAS an event log that had
    completely filled up with messages of the form:

    <datetime> smbios 0x02 DIMMB1

    with many for DIMMB1 and DIMMB2. I haven't found any documentation yet
    of "0x02" other than a few online posts calling it either a single-bit
    or a multi-bit ECC memory error.

    I'm still favoring a diagnosis of two bad DIMMs; I just wish there were
    a way to cause these errors to show up in FreeBSD somewhere so I could
    detect them on a running system.


    On Sun, Feb 01, 2026 at 08:30:56PM +0000, Bob Bishop wrote:
    Hi,

    On 1 Feb 2026, at 16:35, G. Paul Ziemba <pz-freebsd-stable@ziemba.us> wrote:

    OS: 14.2-STABLE as of 250403

    I seem to have at least one bad ECC DIMM

    Check the power supply voltages are within tolerance if you haven???t
    already.

    and was expecting to see MCA
    messages in /var/log/messages or to the console (which I have recently
    redirected to /var/log/console.log via syslog.conf:

    console.info /var/log/console.log

    but I can't find anything in any of my logs. Why am I not seeing them?

    If you have the -F variant of the board that supports IPMI, it may be that >> the BMC is capturing the errors so check the BMC event log. Possibly there >> is a setting on the BMC to control what gets passed to MCA.

    Also check the BIOS event logging; I don???t see settings in the BIOS to
    control MCA events.

    And check the BIOS version is up to date.

    Background:

    Motherboard: Supermicro X11SCA
    CPU: Xeon E-2176G
    Chipset: C246
    Memory: 4x SK Hynix HMA82GU7CJR8N-VK (16GB ECC)

    Bios reports ECC on its startup screen and dmidecode reports

    Total Width: 72 bits
    Data Width: 64 bits

    for each of the dimms.

    Amanda started reporting checksum errors on large backup files in its
    holding disk. I discovered that a large file (200GB) on any of three
    disks on this system yields different sha512sum values every time I
    run it on the same file. SMART data looks OK on all disks.

    memtest86+ finds three bad spots in memory, at 42G, 47G and 53G. I have
    4x16GB dimms installed, so I think that corresponds to two bad dimms.

    % sysctl hw.mca
    hw.mca.cmc_throttle: 60
    hw.mca.force_scan: 0
    hw.mca.interval: 300
    hw.mca.maxcount: -1
    hw.mca.count: 0
    hw.mca.erratum383: 0
    hw.mca.intel6h_HSD131: 0
    hw.mca.amd10h_L1TP: 1
    hw.mca.log_corrected: 1
    hw.mca.enabled: 1

    Thanks for any insights.
    --
    G. Paul Ziemba
    FreeBSD unix:
    8:31AM up 2 days, 14:38, 11 users, load averages: 0.71, 0.43, 0.39



    --
    Bob Bishop t: +44 (0)118 940 1243
    rb@gid.co.uk m: +44 (0)783 626 4518






    --Chris
    --=_ac7f1ffec62d053661b333e8a5b6085a
    Content-Transfer-Encoding: 7bit
    Content-Type: application/pgp-keys;
    name=0xE512722F.asc
    Content-Disposition: attachment;
    filename=0xE512722F.asc;
    size=3074

    -----BEGIN PGP PUBLIC KEY BLOCK-----

    mQENBGf/G0IBCADARuJc6IcwOe3jv7dQsP1X/EIHvCFExPbTmlMNFMXbMMccQUnV o8ayEn+wmTvPhw7uL3PDk7DQs16W1sN2b8UMFc804cVWNGtoG3rA+Np+TFEYlXJx eh5Q42VHptkuwzHKl+q2utkpRlS7uHyfjsInQAoHxLyi/wrsaZTHHhDbLLhJ5Ez0 arohQ2Q1w0M5e9rW8Fy5rpC7RpC6uO1SZMxcbdqURI/BBqxbiD1iW62cDWFkfFX+ dtaEXghFV7BIBMDSrgIunGoEfdMZgXys7O6bPWn8z0cuOZIPj4HrjoCYARyQ+sdc rjz/k06SLM/UvEZDorJhT4DbYrwMNvaPWJiPABEBAAG0HkNocmlzIDxic2QtbGlz dHNAYnNkZm9yZ2UuY29tPokBNQQQAQgAHwUCZ/8bQgYLCQcIAwIEFQgKAgMWAgEC GQECGwMCHgEACgkQVKBqaOUSci8bSwf/fK3QcTYXRMrv82HIp4SiGCSD7/bRmyWr ipv2vzknGFHxPBN4AEWIqF/U4j5oDXaodyU6xsy59Z47/lgbyzyZiVR6nmJVgZVf el/EgwnLt7ZuYGLLEhIN2pd9itJkB8PMPZrUHMWgIw8BxX5YFYGuyiNe9pGn0Coj 98t/v3fouhqksH+BpB4TBHJBBDSxSiMm66VTJX4Xcnpf0ZnQVP4GBuoyodnFBfdI wqftPLESsCC08lUhD2j7v2NRWwMi/q3ed8D6VCKPImBByYnBZL5gu56K5bwqaQfN itu06APuIYnG71qxgn1EPO63lovWP5NZGgOKvzs3K+JfPF79BiOUFbQjQ2hyaXMg PG1haWxvcEBocmNvbW11bmljYXRpb25zLm5ldD6JATEEEAEIABwFAmf/G0IGCwkH CAMCBBUICgIDFgIBAhsDAh4BAAoJEFSgamjlEnIvBH8H9RGwzZuU6+zvH1WjQa97 yWpEt9rC+BIBJThev2Cpls2LqBqIeIQVZPnyLAZWgFaiezL6+xbvcNt6OnfidIYa x8iRwCMC6/Bs8H2Wef9qfGxXi+jHPLYQk3juiZVmBhIK6FJZkzaW4wSiawofwzbp zqNxO8dZ0j4foaJZrNi8iqsvKjiiHoSFaJtumIThAeydI18CNLeFaS53sk5nad6I wCYeFKmJ/22dMP7DOFEgyG1iNYgY+AGREMkEsBiLpqYjJ5asK+1UdUy/TRly1hOt HHxCiX0Fh9ZYM2vLIj7sq4LKaMPGeYC3qTqBYugVeyz7LkiI2ft/BKveA5JxuYKk ZrQiQ2hyaXMgPG5hbm9nQGhyY29tbXVuaWNhdGlvbnMubmV0PokBMgQQAQgAHAUC Z/8bQgYLCQcIAwIEFQgKAgMWAgECGwMCHgEACgkQVKBqaOUSci+4Bwf8D0Ogk2/X ud/CsAgHozwzKPqfesL5SRWM14hLnU9/EHoplnZgNexbVY1wXIi2FYPo5cve9QxW Nmt3S3UTF9j2fGqv0wmeHv3EqogFUHnftLyWpbeTPOFDMIQp/BOD6ygfeXxXWxRT L6zvUkSrDtHvkQHPWGRxwP+ihWjpw9AQR/R4/qAuTAZZM0O7UnJEo4mWXatl+utF wegG2giwFTTxfF+1rMpFtUDjYCpRQ6ZmE+gC1mHUMoH7GJMQv12DbqwKrxtwGfd0 AJNO3ZDnxl24BmIfl1YqQGZQ5iIH7At4YItESbU45hoNNsG9oDrsil78EUCAtXHd UPScj+eXaeAkgrQfQ2hyaXMgPHBvcnRtYXN0ZXJAYnNkZm9yZ2UuY29tPokBMgQQ AQgAHAUCZ/8bQgYLCQcIAwIEFQgKAgMWAgECGwMCHgEACgkQVKBqaOUSci9o7Af+ Lwu5hJlI5HZNGwAll7QTIFZVW+y4OEg+amhxTDGbAAqlnSIkHC1KgkmIOOrThme3 kTFCqfIIsuP73yKxHq6kRG0zH5/7asAPNAUOfzD7B2o/gMyuTRKyG5r9f3UmACr4 6qvtFhIwROXr6+NNT2IKg3l0/8F58A0N/TR8D2PTHeo4x6jYcZQDCrCy7BAdk3cu V16k4z/1UzRa07b5McezbWL20cIaZ+dqNcCjKZpzPlTyTCGgrNNtaDpNVhoWUKMB YNcKql+tfC1IpX8l+IU6OBKcDKMkQojvO1QrZqY8MDJGo8jq/CtotQ8+IpAai3Bx dQEsxrxlcKTR4rUqvd8VGbkBDQRn/xtCAQgAv5Nv/aQN72xsLik+K73PJwpUmyhu vnI6stM6dSecylXVHjZ7C4n/m0eQEeQCl+9lByHR9N8H+WS3DtAd4pmciiIxRQLA JZiuaLYcy9ziy1h7130VoR7hhJHzo9FIhWkTGlCDX3egUZrYhMiwFUO8lNltLB8o TBvIrMSsnUzawtQjq/otv0Jf+oBPbG+gIYnAm7w6r86n/l+eVxf5eEoS7wV0DJfp b2jE5zWErWk8I/tq4e8T+1VQeVQR6wz+NrUCSxkPkpNAm19AFUHOk//yvMGWVlDW F6gr3ErN2a0w/kZ0lz3Msxsb87QT+MnJf/T3cuEqdTIoSk74BfNEAdMohQARAQAB iQEfBBgBCAAJBQJn/xtCAhsMAAoJEFSgamjlEnIvyvIH/26zytSVNDaxtprg7XtX LerIWf9RyVx8omCw/lXKRCcgkfwD7QR+nSZ0thWOGMpcnivjuReeVRkz/webUF47 BXJ/Tge07nrxdtyTIHBbp35fPIriaKaII6YWc2Ufdxwv+cD8PADS6gQWAlgrWLmn VmYtyHs4kwtiPZyUyuBdWnZal2GyYY0WVwYjvbk95eInwOaIdoTjesJ7ZhUFu155 r4hh9GlvM0uv8WJ5Mw9wvHa5fIM205I5g0IWC7yvTwwwKHlV4JQQOqMwfv569OEl 1GKqA12nSVziB1+UV+I0NqOABWi/MOi+IySPzYP+XgdPfRNx4vmoHYZwWOQ3t4Jd
    TEM=
    =oj6y
    -----END PGP PUBLIC KEY BLOCK-----

    --=_ac7f1ffec62d053661b333e8a5b6085a--


    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From G. Paul Ziemba@pz-freebsd-stable@ziemba.us to muc.lists.freebsd.stable on Fri Feb 6 18:54:50 2026
    From Newsgroup: muc.lists.freebsd.stable

    For X11SCA owners who might find this thread in the future:

    Test results:

    I did some more exhaustive testing with memtest86 and individual
    DIMMs installed. These DIMMs were all SK Hynix HMA82GU7CJR8N-VK,
    purchased directly from Supermicro around 2019-2020 in one batch.

    All four DIMMs had varying degrees of failure. Two of them had in
    the neighborhood of 20 errors each in one full pass of memtest86.
    A third had about two errors in a full pass of memtest86. In all
    of those cases, a bunch of "smbios 0x02" events showed up in the
    event log visible from the BIOS setup screens.

    The fourth original DIMM had no errors in one full pass of memtest86,
    but generated a few events in the smbios log.

    I got two new noname ECC DIMMs and tested them. No errors in one
    full pass of memtest86, and no smbios events logged.

    Future monitoring:

    I'm still dismayed that FreeBSD doesn't seem to notice/report these
    ECC events. I have not had a chance to exhaustively search the BIOS
    setup screens for a setting that might enable signaling the OS, yet.

    However, I noticed that "dmidecode" reports information about the
    "System Event Log". I found the published SMBIOS specification
    (see, for example, https://www.dmtf.org/standards/smbios) that
    describes the format of the System Event Log.

    I wrote a simple perl script to open /dev/mem, seek to the start
    address, and read the log area and got back what looked like a
    valid event log. It should be straightforward to parse the log entries
    and discover ECC events, so I can build a monitoring solution for
    this motherboard.

    pz-freebsd-stable@ziemba.us ("G. Paul Ziemba") writes:

    Bob,

    thanks for your suggestions.

    The motherboard is a plain X11SCA (no -F ipmi)

    I don't know of a way to read the power supply voltages in software
    while FreeBSD is running, but I did reboot into the BIOS setup and
    read voltages there, and they look normal to me:

    VCPU: 1.136
    VDIMM: 1.224
    12V: 12.233
    5VCC: 5.184
    3.3V_DL: 3.327
    3.3VCC: 3.424
    VSB: 3.328
    VBAT: 3.104
    VCC1_8_DL_PCM: 1.816

    The BIOS versions are given as:

    "ver 1.2 Build Date 12/5/19" near the top of the screen; and
    "version 2.19.0045 (c) [AMI]" at the bottom of the screen

    I didn't see a setting that (apparently to me) might control how
    events might be filtered, but there WAS an event log that had
    completely filled up with messages of the form:

    <datetime> smbios 0x02 DIMMB1

    with many for DIMMB1 and DIMMB2. I haven't found any documentation yet
    of "0x02" other than a few online posts calling it either a single-bit
    or a multi-bit ECC memory error.

    I'm still favoring a diagnosis of two bad DIMMs; I just wish there were
    a way to cause these errors to show up in FreeBSD somewhere so I could
    detect them on a running system.
    --
    G. Paul Ziemba
    FreeBSD unix:
    10:51AM up 1 day, 13:34, 17 users, load averages: 0.35, 0.25, 0.26


    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21b-Linux NewsLink 1.2