• Re: Ubuntu 22 Boot Errors

    From Andy Burns@usenet@andyburns.uk to alt.os.linux,uk.comp.os.linux on Sun Dec 8 12:00:34 2024
    From Newsgroup: uk.comp.os.linux

    Java Jive wrote:

    -a-a-a-a-a ERROR: Unable to locate IOAPIC for GSI 37

    Possible dodgy config tables in BIOS/UEFI.

    firmware upgrade?
    or legacy options you can disable?
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From vallor@vallor@cultnix.org to alt.os.linux,uk.comp.os.linux on Mon Dec 9 03:36:32 2024
    From Newsgroup: uk.comp.os.linux

    On Sun, 8 Dec 2024 11:41:05 +0000, Java Jive <java@evij.com.invalid> wrote
    in <vj40kk$3p436$1@dont-email.me>:

    Dec 06 21:18:47 HOSTNAME kernel: *BAD*gran_size: 128M chunk_size: 2G
    num_reg: 10 lose cover RAM: -834M

    I don't know why this would happen, but if it happened to me,
    I'd run "lsmem" and "free" and make sure all the memory eventually
    made it online...

    Also, it looks like it might have something to do with mtrr. On
    my host, dmesg reads:

    [ 0.000000] total RAM covered: 3071M
    [ 0.000000] Found optimal setting for mtrr clean up
    [ 0.000000] gran_size: 64K chunk_size: 128M num_reg: 3
    lose cover RAM: 0G
    [ 0.000000] MTRR map: 7 entries (3 fixed + 4 variable; max 20), built
    from 9 variable MTRRs

    So these entries may be due to MTRR, which according to Documentation/arch/x86/mtrr.rst is getting phased out. On my
    system, I see this when I cat /proc/mtrr:

    $ sudo cat /proc/mtrr
    reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back
    reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
    reg02: base=0x0ba0a0000 ( 2976MB), size= 64KB, count=1: uncachable

    Give a gander to that document, it outlines what mtrr might be used for
    in modern-day kernels.

    https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/arch/x86/mtrr.rst?h=v6.12.3
    --
    -v System76 Thelio Mega v1.1 x86_64 NVIDIA RTX 3090 Ti
    OS: Linux 6.12.3 Release: Mint 21.3 Mem: 258G
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Java Jive@java@evij.com.invalid to alt.os.linux,uk.comp.os.linux on Mon Dec 9 20:33:25 2024
    From Newsgroup: uk.comp.os.linux

    On 2024-12-09 03:36, vallor wrote:

    On Sun, 8 Dec 2024 11:41:05 +0000, Java Jive <java@evij.com.invalid> wrote
    in <vj40kk$3p436$1@dont-email.me>:

    Dec 06 21:18:47 HOSTNAME kernel: *BAD*gran_size: 128M chunk_size: 2G
    num_reg: 10 lose cover RAM: -834M

    I don't know why this would happen, but if it happened to me,
    I'd run "lsmem" and "free" and make sure all the memory eventually
    made it online...

    My first reaction on seeing the messages - which in my OP I should
    have mentioned that I'd already done, but it slipped my mind - was to
    boot into memcheck and let a full cycle complete, but no memory problems
    were found.

    Also, it looks like it might have something to do with mtrr.

    Yes, as originally I linked.

    On
    my host, dmesg reads:

    [ 0.000000] total RAM covered: 3071M
    [ 0.000000] Found optimal setting for mtrr clean up
    [ 0.000000] gran_size: 64K chunk_size: 128M num_reg: 3
    lose cover RAM: 0G
    [ 0.000000] MTRR map: 7 entries (3 fixed + 4 variable; max 20), built
    from 9 variable MTRRs

    So these entries may be due to MTRR, which according to Documentation/arch/x86/mtrr.rst is getting phased out. On my
    system, I see this when I cat /proc/mtrr:

    $ sudo cat /proc/mtrr
    reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back
    reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
    reg02: base=0x0ba0a0000 ( 2976MB), size= 64KB, count=1: uncachable

    Give a gander to that document, it outlines what mtrr might be used for
    in modern-day kernels.

    https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/arch/x86/mtrr.rst?h=v6.12.3

    These are both Dell Precisions, an M6700 and an M6800, so no chance of
    PAT. One of the links I gave suggests giving some kernel boot
    parameters to suggest a good compromise setting which minimises unused
    RAM and thereby saves the failed testing which gave rise to the testing
    trace that I quoted. I wan't very specific in my OP, but it think the
    best I can do is determine such boot parameters, and any help with that
    from someone more knowledgeable than myself would be much appreciated.
    --

    Fake news kills!

    I may be contacted via the contact address given on my website: www.macfh.co.uk

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Java Jive@java@evij.com.invalid to alt.os.linux,uk.comp.os.linux on Tue Dec 10 13:42:31 2024
    From Newsgroup: uk.comp.os.linux

    On 2024-12-08 12:00, Andy Burns wrote:

    Java Jive wrote:

    -a-a-a-a-a-a ERROR: Unable to locate IOAPIC for GSI 37

    Possible dodgy config tables in BIOS/UEFI.

    firmware upgrade?
    or legacy options you can disable?

    Yes, thanks, thinking that there might be something to any of your
    suggestions above, I've spent the last day or so trying to determine
    exactly which PCs were showing that fault. Originally, I'm fairly sure
    that there were more than just one, but now it seems to be just one,
    this one. I think the difference between the time of my OP and now is
    that in the meantime I've continued trying to clean up the boot
    messages, and this has involved uninstalling Virtualbox on all of the
    PCs, including this one. So ...

    My best guess for the other PCs is that Virtualbox was causing the messages.

    As for this PC, which is a Dell Precision M6800, something that I've
    noticed for the M6700/6800 series, which are of near identical design to
    each other, is that sometimes the COM port shows up under Windows as
    having problems and not installing properly. I'm not quite sure why
    this should be, but as there is no external serial connector anyway, and
    I haven't used an actual COM port [*] for around two decades either, I
    haven't bothered to investigate this phenomenon further.

    * I've used USB-to-low-voltage-serial cables, such as the Sony DKU-5
    cable that used to be used to connect their phones to a PC, to flash
    hardware such as routers, but not an actual COM port.

    Putting the above together with the fact that I've noticed that, on a
    Dell Inspiron, a daughterboard for connecting an NVMe only actually has
    the connector on those model variants that were supplied originally with
    an NVMe drive, other models have an otherwise identical daughterboard
    but without the actual connector - thus saving a few cents per PC
    sale! - I'm wondering if with these M6700/6800s Dell may have been
    doing something similar, populating the boards with some, but not all,
    of the hardware necessary for the COM port, this time saving on some of
    the actual chips required instead of just a connector.

    Either that or a PCB has a fault, but, being ATM in the middle of
    'churning' my hardware, I have three of these machines, and two of them
    show similar symptoms relating to the COM port, which seems a rather
    high improbability of 2 out of 3 machines bought pseudo randomly from different eBay suppliers having the same board fault on arrival?

    At any rate, my best guess for this PC is that the original message that
    I queried is being caused by the 'faulty' COM port.
    --

    Fake news kills!

    I may be contacted via the contact address given on my website: www.macfh.co.uk

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Paul@nospam@needed.invalid to alt.os.linux,uk.comp.os.linux on Tue Dec 10 09:40:30 2024
    From Newsgroup: uk.comp.os.linux

    On Tue, 12/10/2024 8:42 AM, Java Jive wrote:
    On 2024-12-08 12:00, Andy Burns wrote:

    Java Jive wrote:

    -a-a-a-a-a-a ERROR: Unable to locate IOAPIC for GSI 37

    Possible dodgy config tables in BIOS/UEFI.

    firmware upgrade?
    or legacy options you can disable?

    Yes, thanks, thinking that there might be something to any of your suggestions above, I've spent the last day or so trying to determine exactly which PCs were showing that fault.-a Originally, I'm fairly sure that there were more than just one, but now it seems to be just one, this one.-a I think the difference between the time of my OP and now is that in the meantime I've continued trying to clean up the boot messages, and this has involved uninstalling Virtualbox on all of the PCs, including this one.-a So ...

    My best guess for the other PCs is that Virtualbox was causing the messages.

    As for this PC, which is a Dell Precision M6800, something that I've noticed for the M6700/6800 series, which are of near identical design to each other, is that sometimes the COM port shows up under Windows as having problems and not installing properly.-a I'm not quite sure why this should be, but as there is no external serial connector anyway, and I haven't used an actual COM port [*] for around two decades either, I haven't bothered to investigate this phenomenon further.

    *-a I've used USB-to-low-voltage-serial cables, such as the Sony DKU-5 cable that used to be used to connect their phones to a PC, to flash hardware such as routers, but not an actual COM port.

    Putting the above together with the fact that I've noticed that, on a Dell Inspiron, a daughterboard for connecting an NVMe only actually has the connector on those model variants that were supplied originally with an NVMe drive, other models have an otherwise identical daughterboard but without the actual connector-a --a thus saving a few cents per PC sale!-a --a I'm wondering if with these M6700/6800s Dell may have been doing something similar, populating the boards with some, but not all, of the hardware necessary for the COM port, this time saving on some of the actual chips required instead of just a connector.

    Either that or a PCB has a fault, but, being ATM in the middle of 'churning' my hardware, I have three of these machines, and two of them show similar symptoms relating to the COM port, which seems a rather high improbability of 2 out of 3 machines bought pseudo randomly from different eBay suppliers having the same board fault on arrival?

    At any rate, my best guess for this PC is that the original message that I queried is being caused by the 'faulty' COM port.


    This is just a random suggestion, with no evidence to back it up.

    Power off the machine, remove one of the DIMMs and try your dmesg
    readout a second time. and see if your granularity issue changes.

    it could be that the address map on the chipset is defective somehow.

    The machine I'm typing on, has such a problem, and it used to freeze
    in the graphics driver, because the shared memory was somehow double defined
    or something. It's my suspicion that with less than max RAM installed,
    it would behave itself.

    *******

    The other idea I tried out here, is I figured a machine with Intel Management Engine,
    the addressing may need to provide space for Minux to run. And maybe the offset causes by that, is the problem. But when I tested that theory on the Optiplex 780,
    dmesg was as clean as could be. It looked like PAT had been used. So that does not
    look like a credible possibility.

    Paul
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Theo@theom+news@chiark.greenend.org.uk to alt.os.linux,uk.comp.os.linux on Wed Dec 11 14:53:10 2024
    From Newsgroup: uk.comp.os.linux

    In uk.comp.os.linux Java Jive <java@evij.com.invalid> wrote:
    I'm going round my Ubuntu 22 machines trying to remove error and fail messages from the boot, mostly successfully, but five anomalies are
    proving hard to fix, despite which the PCs all seem to work ...

    1) IOAPIC

    This is occurring on more than one PC. Searching for ...

    ERROR: Unable to locate IOAPIC for GSI 37

    That means it can't work out which interrupt controller is used for a particular interrupt. Likely means the ACPI tables are incomplete. If everything works you can ignore this.

    2) blkmapd

    I think this is occurring on ALL my Ubuntu 22 machines. Appears to be related to NFS, but networking seems fine (apart from a minor seemingly unrelated issue already solved). Searching for ...

    "blkmapd[717]: open pipe file /run/rpc_pipefs/nfs/blocklayout
    failed: No such file or directory"

    Are you actually running NFS? If not you can ignore this.

    3) CUPS Scheduler

    This also is occuring on many or all of my Ubuntu 22 machines, even a
    while after a successful boot. Oddly the status of cups service always shows it to be working.

    Dec 07 17:06:00 HOSTNAME systemd[1]: cups.service: start operation timed out. Terminating.
    Dec 07 17:06:00 HOSTNAME systemd[1]: cups.service: Failed with result 'timeout'.
    Dec 07 17:06:00 HOSTNAME systemd[1]: Failed to start CUPS Scheduler.
    Dec 07 17:06:00 HOSTNAME systemd[1]: cups.service: Scheduled restart
    job, restart counter is at 5.
    Dec 07 17:06:00 HOSTNAME systemd[1]: Stopped CUPS Scheduler.
    Dec 07 17:06:00 HOSTNAME systemd[1]: cups.path: Deactivated successfully.
    Dec 07 17:06:00 HOSTNAME systemd[1]: Stopped CUPS Scheduler.
    Dec 07 17:06:00 HOSTNAME systemd[1]: Stopping CUPS Scheduler...
    Dec 07 17:06:00 HOSTNAME systemd[1]: Started CUPS Scheduler.
    Dec 07 17:06:00 HOSTNAME systemd[1]: cups.socket: Deactivated successfully. Dec 07 17:06:00 HOSTNAME systemd[1]: Closed CUPS Scheduler.
    Dec 07 17:06:00 HOSTNAME systemd[1]: Stopping CUPS Scheduler...
    Dec 07 17:06:00 HOSTNAME systemd[1]: Listening on CUPS Scheduler.
    Dec 07 17:06:00 HOSTNAME systemd[1]: Starting CUPS Scheduler...
    Dec 07 17:06:00 HOSTNAME audit[1760]: AVC apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=1760 comm="cupsd" capability=12 capname="net_>

    I'm not seeing a problem there? CUPS didn't start first time round because something was busy, but tried again and succeeded.

    4) UBSAN

    This is on a laptop with two on-board GPUs, and I think is related to
    that fact. However, there were no hits under DuckDuckGo, Google, or
    Yahoo for ...

    Ubuntu 22 "UBSAN: array-index-out-of-bounds in /build/linux-hwe-6.8-W0MdK2/linux-hwe-6.8-6.8.0/drivers/gpu/drm/radeon/radeon_atombios.c:633:33"

    UBSAN is the Undefined Behaviour Sanitiser, ie a debugging tool. Something went wrong in the driver for AMD Radeon GPUs, ie a bug, maybe due to the relatively elderly GPU you have. If you aren't using the AMD GPU you can ignore it if it's not actually causing a crash (or could disable the driver
    if you wanted).

    5) Initiating RAM registers

    The following is occurring very early in the logs on 2 machines with
    32GB RAM, and seems to be about how set up registers for RAM access, as
    per these two links ...

    This seems to be related to an MTRR problem - maybe the hardware doesn't let the kernel find the optimal memory layout with more RAM than it was
    originally designed for. How much RAM does Linux show you have after it's booted? Does it lose any memory, and can you live with having the amount
    that remains?


    Most of these seem like 'new Linux, old hardware' issues, but nothing
    actually to worry me there. I'd check you're on the latest BIOS as that
    might help some of the ACPI related issues.

    Theo
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Java Jive@java@evij.com.invalid to alt.os.linux,uk.comp.os.linux on Thu Dec 12 12:11:05 2024
    From Newsgroup: uk.comp.os.linux

    On 2024-12-11 14:53, Theo wrote:

    In uk.comp.os.linux Java Jive <java@evij.com.invalid> wrote:

    I'm going round my Ubuntu 22 machines trying to remove error and fail
    messages from the boot, mostly successfully, but five anomalies are
    proving hard to fix, despite which the PCs all seem to work ...

    1) IOAPIC

    This is occurring on more than one PC. Searching for ...

    ERROR: Unable to locate IOAPIC for GSI 37

    That means it can't work out which interrupt controller is used for a particular interrupt. Likely means the ACPI tables are incomplete. If everything works you can ignore this.

    Everything of any importance seems to be working, but see also my reply
    to Andy regarding the COM port.

    2) blkmapd

    I think this is occurring on ALL my Ubuntu 22 machines. Appears to be
    related to NFS, but networking seems fine (apart from a minor seemingly
    unrelated issue already solved). Searching for ...

    "blkmapd[717]: open pipe file /run/rpc_pipefs/nfs/blocklayout
    failed: No such file or directory"

    Are you actually running NFS? If not you can ignore this.

    Yes, and it *seems* to be running fine, so I'm not sure what is going on
    here.

    3) CUPS Scheduler

    This also is occuring on many or all of my Ubuntu 22 machines, even a
    while after a successful boot. Oddly the status of cups service always
    shows it to be working.

    Dec 07 17:06:00 HOSTNAME systemd[1]: cups.service: start operation timed
    out. Terminating.
    Dec 07 17:06:00 HOSTNAME systemd[1]: cups.service: Failed with result
    'timeout'.
    Dec 07 17:06:00 HOSTNAME systemd[1]: Failed to start CUPS Scheduler.
    Dec 07 17:06:00 HOSTNAME systemd[1]: cups.service: Scheduled restart
    job, restart counter is at 5.
    Dec 07 17:06:00 HOSTNAME systemd[1]: Stopped CUPS Scheduler.
    Dec 07 17:06:00 HOSTNAME systemd[1]: cups.path: Deactivated successfully.
    Dec 07 17:06:00 HOSTNAME systemd[1]: Stopped CUPS Scheduler.
    Dec 07 17:06:00 HOSTNAME systemd[1]: Stopping CUPS Scheduler...
    Dec 07 17:06:00 HOSTNAME systemd[1]: Started CUPS Scheduler.
    Dec 07 17:06:00 HOSTNAME systemd[1]: cups.socket: Deactivated successfully. >> Dec 07 17:06:00 HOSTNAME systemd[1]: Closed CUPS Scheduler.
    Dec 07 17:06:00 HOSTNAME systemd[1]: Stopping CUPS Scheduler...
    Dec 07 17:06:00 HOSTNAME systemd[1]: Listening on CUPS Scheduler.
    Dec 07 17:06:00 HOSTNAME systemd[1]: Starting CUPS Scheduler...
    Dec 07 17:06:00 HOSTNAME audit[1760]: AVC apparmor="DENIED"
    operation="capable" profile="/usr/sbin/cupsd" pid=1760 comm="cupsd"
    capability=12 capname="net_>

    I'm not seeing a problem there? CUPS didn't start first time round because something was busy, but tried again and succeeded.

    I see, but then presumably it's an error by the folk who wrote the code
    to flag it as an error.

    4) UBSAN

    This is on a laptop with two on-board GPUs, and I think is related to
    that fact. However, there were no hits under DuckDuckGo, Google, or
    Yahoo for ...

    Ubuntu 22 "UBSAN: array-index-out-of-bounds in
    /build/linux-hwe-6.8-W0MdK2/linux-hwe-6.8-6.8.0/drivers/gpu/drm/radeon/radeon_atombios.c:633:33"

    UBSAN is the Undefined Behaviour Sanitiser, ie a debugging tool. Something went wrong in the driver for AMD Radeon GPUs, ie a bug, maybe due to the relatively elderly GPU you have. If you aren't using the AMD GPU you can ignore it if it's not actually causing a crash (or could disable the driver if you wanted).

    I understand. I'm not aware of any problems with graphics under Ubuntu.

    5) Initiating RAM registers

    The following is occurring very early in the logs on 2 machines with
    32GB RAM, and seems to be about how set up registers for RAM access, as
    per these two links ...

    This seems to be related to an MTRR problem - maybe the hardware doesn't let the kernel find the optimal memory layout with more RAM than it was originally designed for. How much RAM does Linux show you have after it's booted? Does it lose any memory, and can you live with having the amount that remains?

    lsmem gives ...

    root@HOSTNAME:home# lsmem
    RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x00000000bfffffff 3G online yes 0-23 0x0000000100000000-0x000000083fffffff 29G online yes 32-263

    Memory block size: 128M
    Total online memory: 32G
    Total offline memory: 0B

    ... so Ubuntu seems to be able to access all the actual physical RAM.

    Most of these seem like 'new Linux, old hardware' issues, but nothing actually to worry me there. I'd check you're on the latest BIOS as that might help some of the ACPI related issues.

    I see. Thanks very much for your help, Theo, much appreciated.
    --

    Fake news kills!

    I may be contacted via the contact address given on my website: www.macfh.co.uk

    --- Synchronet 3.21d-Linux NewsLink 1.2