• Bug#1105826: cloud.debian.org: Request to Adjust NVMe Timeout Defaults

    From Noah Meyerhans@21:1/5 to All on Thu May 15 18:40:01 2025
    Package: cloud.debian.org
    Severity: normal
    User: cloud.debian.org@packages.debian.org

    Microsoft would like us to adjust the default NVMe timeout settings on our bookworm images to improve reliability on Azure.

    Azure VMs use NVMe for ephemeral storage, and newer VM sizes use it for their root volumes. Microsoft has received reports of Linux systems using the
    kernel default settings failing under certain circumstances with dmesg containing messages such as those shown below.

    Microsoft recommends 240 seconds as as the timeout at https://github.com/Azure/SAP-on-Azure-Scripts-and-Utilities/blob/432d8b3ccd1061aeb95552afc645f5390f1449d1/NVMe-Preflight-Check/azure-nvme-preflight-check.sh#L121-L161

    Additional details on nvme in azure are at https://learn.microsoft.com/en-us/azure/virtual-machines/nvme-linux

    I thought azure-vm-utils was aready doing this, but apparently it's not. I've requested that feature upstream at https://github.com/Azure/azure-vm-utils/issues/80. If it gets implemented upstream in the near term, we should be able to get the change into trixie. However, since that package isn't present in bookworm, we'd need to come up with another approach there.

    dmesg symptoms:
    [169365.182748] nvme nvme0: I/O tag 246 (60f6) opcode 0x2 (Read) QID 21 timeout, aborting req_op:READ(0) size:262144
    [169365.183193] nvme nvme0: Abort status: 0x0
    [169365.183506] nvme nvme0: I/O tag 249 (80f9) opcode 0x2 (Read) QID 21 timeout, aborting req_op:READ(0) size:262144
    [169365.183880] nvme nvme0: Abort status: 0x0
    [169365.184197] nvme nvme0: I/O tag 250 (e0fa) opcode 0x2 (Read) QID 21 timeout, aborting req_op:READ(0) size:262144
    [169365.184564] nvme nvme0: Abort status: 0x0
    [169365.184893] nvme nvme0: I/O tag 251 (d0fb) opcode 0x2 (Read) QID 21 timeout, aborting req_op:READ(0) size:262144
    [169365.185313] nvme nvme0: Abort status: 0x0
    [169365.185627] nvme nvme0: I/O tag 252 (f0fc) opcode 0x2 (Read) QID 21 timeout, aborting req_op:READ(0) size:262144
    [169365.186019] nvme nvme0: Abort status: 0x0
    [169365.186335] nvme nvme0: I/O tag 253 (90fd) opcode 0x2 (Read) QID 21 timeout, aborting req_op:READ(0) size:69632
    [169365.186697] nvme nvme0: Abort status: 0x0
    [169365.497993] nvme nvme0: I/O tag 164 (e0a4) opcode 0x2 (Read) QID 9 timeout, reset controller
    [169368.888085] nvme_log_error: 108 callbacks suppressed
    [169368.888551] nvme0n9: Read(0x2) @ LBA 1179738368, 64 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
    [169368.888995] I/O error, dev nvme0n9, sector 9437906944 op 0x0:(READ) flags 0x84700 phys_seg 64 prio class 2
    [169368.889723] nvme0n9: Read(0x2) @ LBA 1179738432, 64 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
    [169368.890119] I/O error, dev nvme0n9, sector 9437907456 op 0x0:(READ) flags 0x84700 phys_seg 64 prio class 2
    [169368.890439] nvme0n9: Read(0x2) @ LBA 1179738496, 30 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
    [169368.890757] I/O error, dev nvme0n9, sector 9437907968 op 0x0:(READ) flags 0x84700 phys_seg 30 prio class 2
    [169368.891124] nvme0n9: Read(0x2) @ LBA 1179738526, 64 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
    [169368.891444] I/O error, dev nvme0n9, sector 9437908208 op 0x0:(READ) flags 0x84700 phys_seg 64 prio class 2
    [169368.891764] nvme0n9: Read(0x2) @ LBA 1179738590, 64 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
    [169368.892124] I/O error, dev nvme0n9, sector 9437908720 op 0x0:(READ) flags 0x84700 phys_seg 64 prio class 2
    [169368.892443] nvme0n9: Read(0x2) @ LBA 1179738654, 64 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
    [169368.892759] I/O error, dev nvme0n9, sector 9437909232 op 0x0:(READ) flags 0x84700 phys_seg 64 prio class 2
    [169368.893118] nvme0n9: Read(0x2) @ LBA 1179738718, 64 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
    [169368.893438] I/O error, dev nvme0n9, sector 9437909744 op 0x0:(READ) flags 0x84700 phys_seg 64 prio class 2
    [169368.893766] nvme0n9: Read(0x2) @ LBA 1179738782, 64 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
    [169368.894126] I/O error, dev nvme0n9, sector 9437910256 op 0x0:(READ) flags 0x84700 phys_seg 64 prio class 2
    [169368.894448] nvme0n9: Read(0x2) @ LBA 1179738846, 64 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
    [169368.894768] I/O error, dev nvme0n9, sector 9437910768 op 0x0:(READ) flags 0x84700 phys_seg 64 prio class 2
    [169368.895126] nvme0n9: Read(0x2) @ LBA 1179738910, 64 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
    [169368.895448] I/O error, dev nvme0n9, sector 9437911280 op 0x0:(READ) flags 0x84700 phys_seg 64 prio class 2
    [169369.255892] nvme nvme0: 48/0/0 default/read/poll queues
    [169399.478284] nvme nvme0: I/O tag 43 (a02b) QID 15 timeout, disable controller

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)