• Re: A 2025 NewYear present: make dpkg --force-unsafe-io the default?

    From Guillem Jover@21:1/5 to Michael Tokarev on Sat Dec 28 17:10:01 2024
    Hi!

    [ This was long ago, and the following is from recollection from the
    top of my head and some mild «git log» crawling, and while I think
    it's still accurate description of past events, interested people
    can probably sieve through the various long discussions at the time
    in bug reports and mailing lists from references in the FAQ entry,
    which BTW I don't think has been touched since, so might additionally
    be in need of a refresh perhaps, don't know. ]

    On Tue, 2024-12-24 at 12:54:28 +0300, Michael Tokarev wrote:
    The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
    issues,

    The problem showed up with ext4 (not ext2 or ext3), AFAIR when Ubuntu
    switched their default filesystem in their installer, and reports
    started to come in droves about systems being broken.

    For all of its existence (AFAIR) dpkg has performed safe and durable
    operations for its own database (not for the database directories),
    but it was not doing the same for the installed filesystem. That was
    introduced at the time to fix the zero-length file behavior from newer filesystems.

    where a power-cut in the middle of filesystem metadata
    operation (which dpkg does a lot) might result in in unconsistent
    filesystem state. This workaround slowed down dpkg operations
    quite significantly (and has been criticised due to that a lot,
    the difference is really significant).

    I do think the potential for the zero-length files behavior is a
    misfeature of newer filesystems, but I do agree that the fsync()s
    are the only way to guarantee the properties dpkg expects from the
    filesystem. So I don't consider that a workaround at all.

    My main objection was/is with how upstream Linux filesystem
    maintainers characterized all this. Where it looked like they were
    disparaging userland application writers in general for being
    incompetent for no performing such fsync()s, but then when one adds
    them, those programs become extremely slow, and then one would need
    to start using filesystem or OS specific APIs and rather unnatural
    code patterns to regain some semblance of the previous performance.
    I don't think this upstream perspective has changed much, given that
    the derogatory O_PONIES subject still comes up from time to time.

    The workaround is to issue fsync() after almost every filesystem
    operation, instead of after each transaction as dpkg did before.

    Once again: dpkg has always been doing "safe io", the workaround
    was needed for ext2fs only, - it was the filesystem which was
    broken, not dpkg.

    The above also seems quite confused. :) dpkg has always done fsync()
    for both its status file and for every in-core status modification
    via its own journaled support for it (in the /var/lib/dpkg/updates/
    directory).

    What was implemented at the time was to add missing fsync()s for
    database directories, and fsync()s for the unpacked filesystem objects.

    AFAIR:

    * We first implemented that via fsync()s to individual files
    immediately after writing them on unpack, which had acceptable
    performance on filesystems such as ext3 (which I do recall using
    at the time) but was pretty terrible on ext4.
    * Then we reworked the code to defer and batch all the fsync()s for
    a specific package after all the file writes, and before the renames,
    which was a bit better but not great.
    * Then after a while we tried to use a single sync(2) before the
    package file renames, which implied system wide syncs and implied
    terrible performance for unrelated filesystems (such as USB
    drives or network mounts), which got subsequently disabled.
    * Then --force-unsafe-io was added to cope with workloads where the
    safety was not required, or for people who preferred performance
    over safety, on those same new filesystems that required it and
    the option was performance xor safety.
    * Then, after suggestions from Linux filesystem developers we switched
    to initiate asynchronous writebacks immediately after a file unpack
    to not block (via Linux sync_file_range(SYNC_FILE_RANGE_WRITE)),
    and then add a writeback barrier where the previous (disabled)
    sync(2) was (via Linux sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)),
    so that the subsequent fsync(2) would had already been done by that
    time, and would only imply a synchronization point.
    * Then for non-Linux instead of the SYNC_FILE_RANGE_WRITE, a
    posix_fadvise(POSIX_FADV_DONTNEED) was added.
    * Then after a bit the disabled sync(2) code got removed.

    Today, doing an fsync() really hurts, - with SSDs/flash it reduces
    the lifetime of the storage, for many modern filesystems it is a
    costly operation which bloats the metadata tree significantly,
    resulting in all further operations becomes inefficient.

    How about turning this option - force-unsafe-io - to on by default
    in 2025? That would be a great present for 2025 New Year! :)

    Given that the mail is based on multiple incorrect premises, :) and
    that I don't see any tests or data backing up that the fsync()s are
    no longer needed for safety in general, I'm going to be extremely
    reluctant to even consider disabling them by default on the main
    system installation, TBH, and would ask for substantial proof that
    this would not damage user systems, and even then I'd probably still
    feel rather uneasy about it.

    And in fact, AFAIR dpkg is still missing fsync()s for filesystem
    directories, which I think might have been the cause of reported
    leftover files (shared library specifically) that never got removed
    and then caused problems. Still need to prep a testing rig for this
    and try to reproduce that with the WIP branch I've got around.


    OTOH what I also have queued is to add a new --force-reckless-io, to
    suppress all fsync()s (including the ones for the database), which
    would be ideal to be used on installers, chroots or containers (or for
    people who prefer performance over safety, or have lots of backups and
    are aware of the trade-offs :). But that has been kind of blocked on
    adding database tainting support, because the filesystem contents can
    always be checked via digests or can be reinstalled, but if your
    database is messed up it's rather hard to know that. The problem is
    that because installers would want to use that option, we'd end up
    with tainted end systems which would be wrong. Well, or the taint would
    need to be manually removed (making external programs having to reach
    for the dpkg database). But the above and --force-unsafe-io _could_ be
    easily enabled by default in chroot mode (--root) w/o tainting anything
    (I've also got some code to make that possible). And I've got on my
    TODO to add integrity tracking for the database so that damage can be
    more easily detected, which could perhaps make the tainting less of an
    issue.

    (So I'm sorry, but it looks like you'll not get your 2025 present. :)

    Regards,
    Guillem

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Kalnischkies@21:1/5 to All on Sat Dec 28 16:40:01 2024
    Am Sat, Dec 28, 2024 at 10:42:18AM +0100, schrieb Marc Haber:
    On Sat, 28 Dec 2024 00:13:02 +0100, Aurélien COUDERC
    <coucouf@debian.org> wrote:
    Totally agreed : yes it would be extremely useful to have some snapshotting feature for apt operations, and no we're never going to get there if we wait for every single filesystem on every kernel to implement it. So if this has to start with btrfs
    then… great news and super cool !

    Do we have data about how many of our installation would be eligible
    to profit from this new invention? I might think it would be better to

    fwiw this "new invention" isn't one at all.
    Julian was talking about the more than a decade old https://launchpad.net/apt-btrfs-snapshot

    But yeah, most of the concerns Guillem has for dpkg apply to apt also,
    as it would be kinda sad if a failed unattended upgrade in the
    background resets your DebConf presentation slides to a previous
    snapshot (aka: empty), so that kinda requires a particular setup
    and configuration. Not something you can silently roll out on the
    masses as the new and only default in Debian zurg.


    spend time on features that all users benefit from (in the case of

    Yeah… no. I am hard pressed to name a single feature that benefits "all users". You might mean "that benefits folks similar to me" given your
    example is conf files but that isn't even close to "all".

    I would even suspect most apt runs being under the influence of DEBCONF_FRONTEND=noninteractive and some --force-conf* given its
    prevalence in containers and upgrade infrastructure and so your "all"
    might not even mean "the majority" aka a minority use case…

    But don't worry, with some luck we might even work on your fringe use
    cases some day. Sooner if you help, of course.


    Fun fact: apt has code specifically for a single filesystem already:
    JFFS2. The code makes it so that you can run apt on systems that used
    that filesystem out of the box like the OpenMoko Freerunner.
    (Okay, the code is not specific for that filesystem, it is just the only
    known filesystem that lacks the feature apt uses otherwise: mapping
    a file – its binary cache – into shared memory, see mmap(2)).
    And yet, somehow, more than a decade later, people still use apt on other filesystems (I kinda suspect "only" nowadays actually).


    Best regards

    David Kalnischkies

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEE5sn+Q4uCja/tn0GrMRvlz3HQeIMFAmdwGZ8ACgkQMRvlz3HQ eINaEw/9Gj4HvRi+/zjWtsE8lkAxv1isHErNW3uThBYZpB/rXpUBs0ujs/yFkaf2 rBho0nr7CJQ517dslW67K2fmJD4elalmmv4xeaDK5lgMbqq2ytDIO8o9FicS3u5C Bi+HnNApLB2fTyyQDtchz1gqfHmcHcPxZLVbCSOadTvwj/XQHAMG0KlnSfPfv8ZK zdJj97KM8gVPx8Iq0jBnqHwOivZyZL+1fgsRwdNgL40o4qKNLBv7zO9p3ts3xK9Z pBZBoqTeQl4mLkRvq6JY/5O5kM9U0icInFaPMVaoNzSClVqxUKfGm8Go1U9v9vka LGoIo2NhNbc935KBQGePBvqWXGjQ805ggkD8MN0nOKV5w5eOtcu9F3Z/4MC4l4KN qpKvCIbj8z5w3zrF5xEd1rdpR85xBw8NzzH4ky86FoRl7amoMNAPbZRZ4swPSR/f zkmuygmbdBojhTCGFHNX4IwrLtVR8jFpYZq/Fg7sN6iqhXleP6RhU85VMKx/eWJC 2m23WYVbH2PcqP7nM3NQAkytMpCGki0uJKI+DaX25PM8tcTkCjulGLiHzylrHfa5 UpU7EmoQw03xShsEtD8O6XiPsqnyuGTS1Hm6YotKRqbOnfXTYvrwYi6CBAXc9M/B pmgXAPXm3zj/Rl6PQsVexfhSWMwXAkSaybfjaszPd8iGVOjQOwo=
    =ttWK
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gioele Barabucci@21:1/5 to Michael Tokarev on Sat Dec 28 20:30:01 2024
    On 24/12/24 10:54, Michael Tokarev wrote:
    Today, doing an fsync() really hurts, - with SSDs/flash it reduces
    the lifetime of the storage, for many modern filesystems it is a
    costly operation which bloats the metadata tree significantly,
    resulting in all further operations becomes inefficient.

    How about turning this option - force-unsafe-io - to on by default
    in 2025?  That would be a great present for 2025 New Year! :)

    There is a possible related, but independent, optimization that has the
    chance to significantly reduce dpkg install's time up to 90%.

    There is PoC patch [1,2] to teach dpkg to reflink files from data.tar
    instead of copying them. With no changes in semantics or FS operations,
    the time to install big packages like linux-firmware goes down from 39
    seconds to 6 seconds. The use of reflink would have no adverse
    consequences for users of ext4, but it would greatly speed up package installation on XFS, btrfs and (in some cases) ZFS.

    [1] https://bugs.debian.org/1086976
    [2] https://github.com/teknoraver/dpkg/compare/main...cow

    Regards,

    --
    Gioele Barabucci

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Guillem Jover@21:1/5 to Gioele Barabucci on Sun Dec 29 00:00:01 2024
    Hi!

    On Sat, 2024-12-28 at 20:28:30 +0100, Gioele Barabucci wrote:
    There is a possible related, but independent, optimization that has the chance to significantly reduce dpkg install's time up to 90%.

    There is PoC patch [1,2] to teach dpkg to reflink files from data.tar
    instead of copying them. With no changes in semantics or FS operations, the time to install big packages like linux-firmware goes down from 39 seconds
    to 6 seconds. The use of reflink would have no adverse consequences for
    users of ext4, but it would greatly speed up package installation on XFS, btrfs and (in some cases) ZFS.

    I've not closed that bug report yet, because I've been meaning to
    ponder whether there is something from the proposal there that could
    be used to build upon. And whether supporting that special case makes
    sense at all.

    Unfortunately as it stands, that proposal requires for .debs to have
    been fsync()ed beforehand (by the frontend or the user or something),
    for the data.tar to not be compressed at all, and introduces a layer
    violation which I think makes the .deb handling less robust as I think
    would make the tar parser trip over appended ar members after the
    data.tar for example.

    Part of the trick here is that the fsync()s are skipped, but I think
    even if none of the above were problems, then we'd still need to
    fsync() stuff to get the actual filesystem entries to make sense, so
    the currently missing directory fsync()s might be a worse problem for
    such reflinking than the proposed disabled file data fsync()s in the
    patch. But I've not checked how reflinking interacts in general with
    fsync()s, etc.

    Thanks,
    Guillem

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theodore Ts'o@21:1/5 to Michael Stone on Sun Dec 29 02:30:01 2024
    On Thu, Dec 26, 2024 at 01:19:34PM -0500, Michael Stone wrote:
    Further reading: look at the auto_da_alloc option in ext4. Note that it says that doing the rename without the sync is wrong, but there's now a heuristic in ext4 that tries to insert an implicit sync when that anti-pattern is used (because so much data got eaten when people did the wrong thing). By leaning on that band-aid dpkg might get away with skipping the sync, but doing so would require assuming a filesystem for which that implicit guarantee is available. If you're on a different filesystem or a different kernel all
    bets would be off. I don't know how much difference skipping the fsync's makes these days if they get done implicitly.

    Note that it's not a sync, but rather, under certain circumstances, we
    initiate writeback --- but we don't wait for it to complete before
    allowing the close(2) or rename(2) to complete. For close(2), we will
    initiate a writeback on a close if the file descriptor was opened
    using O_TRUNC and truncate took place to throw away the previous
    contents of the file. For rename(2), if you rename on top of a
    previously existing file, we will initiate the writeback right away.
    This was a tradeoff between safety and performance, and this was done
    because there was an awful lot of buggy applications out there which
    didn't use fsync, and the number of application programmers greatly
    outnumbered the file system programmer. This was a compromise that
    was discussed at a Linux Storage, File Systems, and Memory Management
    (LSF/MM) conference many years ago, and I think other file systems
    like btrfs and xfs had agreed in principle that this was a good thing
    to do --- but I can't speak to whether they actually implemented it.

    It's very likely though that file systems that didn't exist at that
    time frame, or by programmers who care a lot more about absolute
    performance than say, usability in real world circumstances, wouldn't
    have implemented this workaround. And so both between the fact that
    it's not perfect (it narrows the window of vulnerability from 30
    seconds to a fraction of a second, but it's certainly not perfect) and
    the fact that not all file systems will implement this (I'd be shocked
    if bcachefs had this feature), are both good reasons not to depend on
    it. Of course, if you use crappy applications, you as a user may very
    well be depending on it without knowing it --- which is *why*
    auto_da_alloc exists. :-)

    That being said, there are things you could do to speed up dpkg which
    are both 100% safe; the trade-off, as always is implementation
    complexity. (The reason why many application programs opened with
    O_TRUNC and rewrote a file was so they wouldn't have to copy over
    extended attributes and Posix ACL's, because that was Too Hard and Too Complicated.) So what what dpkg could do is whenever there is a file
    that dpkg would need to overwrite, to write it out to
    "filename.dpkg-new-$pid" and keep a list of all the files. After all
    of the files are written out, call syncfs(2) --- on Linux, syncfs(2)
    is synchronous, although POSIX does not guarantee that the writes will
    be written and stable at the time that syncfs(2) returns. But that
    should be OK, since Debian GNU/kFreeBSD is no longer a thing. Only
    after syncfs(2) returns, do you rename all of the dpkg-new files to
    the final location on disk.

    This is much faster, since you're not calling fsync(2) for each file,
    but only forcing a file system commit operation just once. The cost
    is more implementation complexity in dpkg. I'll let other people
    decide how to trade off implemetation complexity, performance, and
    safety.

    Cheers,

    - Ted

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Florian Weimer@21:1/5 to All on Sun Jan 5 10:00:01 2025
    * Theodore Ts'o:

    On Thu, Dec 26, 2024 at 01:19:34PM -0500, Michael Stone wrote:
    Further reading: look at the auto_da_alloc option in ext4. Note that it says >> that doing the rename without the sync is wrong, but there's now a heuristic >> in ext4 that tries to insert an implicit sync when that anti-pattern is used >> (because so much data got eaten when people did the wrong thing). By leaning >> on that band-aid dpkg might get away with skipping the sync, but doing so
    would require assuming a filesystem for which that implicit guarantee is
    available. If you're on a different filesystem or a different kernel all
    bets would be off. I don't know how much difference skipping the fsync's
    makes these days if they get done implicitly.

    Note that it's not a sync, but rather, under certain circumstances, we initiate writeback --- but we don't wait for it to complete before
    allowing the close(2) or rename(2) to complete. For close(2), we will initiate a writeback on a close if the file descriptor was opened
    using O_TRUNC and truncate took place to throw away the previous
    contents of the file. For rename(2), if you rename on top of a
    previously existing file, we will initiate the writeback right away.
    This was a tradeoff between safety and performance, and this was done
    because there was an awful lot of buggy applications out there which
    didn't use fsync, and the number of application programmers greatly outnumbered the file system programmer. This was a compromise that
    was discussed at a Linux Storage, File Systems, and Memory Management (LSF/MM) conference many years ago, and I think other file systems
    like btrfs and xfs had agreed in principle that this was a good thing
    to do --- but I can't speak to whether they actually implemented it.

    As far as I know, XFS still truncates files with pending writes during
    mount if the file system was not unmounted cleanly. This means that
    renaming for atomic replacement does not work reliably without fsync.
    (But I'm not a file system developer.)

    So what what dpkg could do is whenever there is a file that dpkg
    would need to overwrite, to write it out to "filename.dpkg-new-$pid"
    and keep a list of all the files. After all of the files are
    written out, call syncfs(2) --- on Linux, syncfs(2) is synchronous,
    although POSIX does not guarantee that the writes will be written
    and stable at the time that syncfs(2) returns. But that should be
    OK, since Debian GNU/kFreeBSD is no longer a thing. Only after
    syncfs(2) returns, do you rename all of the dpkg-new files to the
    final location on disk.

    Does syncfs work for network file systems?

    Maybe a more targeted approach with a first pass of sync_file_range
    with SYNC_FILE_RANGE_WRITE, followed by a second pass with fsync would
    work?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Stone@21:1/5 to Theodore Ts'o on Sun Dec 29 18:00:02 2024
    On Sat, Dec 28, 2024 at 08:22:17PM -0500, Theodore Ts'o wrote:
    Note that it's not a sync, but rather, under certain circumstances, we >initiate writeback --- but we don't wait for it to complete before
    allowing the close(2) or rename(2) to complete. For close(2), we will >initiate a writeback on a close if the file descriptor was opened
    using O_TRUNC and truncate took place to throw away the previous
    contents of the file. For rename(2), if you rename on top of a
    previously existing file, we will initiate the writeback right away.
    This was a tradeoff between safety and performance, and this was done
    because there was an awful lot of buggy applications out there which
    didn't use fsync, and the number of application programmers greatly >outnumbered the file system programmer. This was a compromise that
    was discussed at a Linux Storage, File Systems, and Memory Management >(LSF/MM) conference many years ago, and I think other file systems
    like btrfs and xfs had agreed in principle that this was a good thing
    to do --- but I can't speak to whether they actually implemented it.

    xfs is actually where I first encountered this issue with dpkg. I think
    it was on an alpha system not long after xfs was released for linux,
    which was not necessarily the most stable combination. The machine
    crashed during a big dpkg run and on reboot the machine had quite a lot
    of empty files where it should have had executables and libraries. I
    think this was somewhat known in that time frame (1999/2000) but it was
    written off as xfs being buggy and I don't recall it getting a lot of attention. (Though I still run into people who insist xfs is prone to
    file corruption based on experiences like this from 25 years ago.) Also,
    xfs wasn't being used for / much, and was mostly found on the kind of
    systems that didn't lose power and with the kind of apps that either
    didn't care about partially written files or were more careful about how
    they wrote--so the number of people affected was pretty small. When ext starting exhibiting similar behavior it suddenly became a much bigger
    deal.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Tokarev@21:1/5 to All on Tue Dec 24 11:00:01 2024
    Hi!

    The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
    issues, where a power-cut in the middle of filesystem metadata
    operation (which dpkg does a lot) might result in in unconsistent
    filesystem state. This workaround slowed down dpkg operations
    quite significantly (and has been criticised due to that a lot,
    the difference is really significant).

    The workaround is to issue fsync() after almost every filesystem
    operation, instead of after each transaction as dpkg did before.

    Once again: dpkg has always been doing "safe io", the workaround
    was needed for ext2fs only, - it was the filesystem which was
    broken, not dpkg.

    Today, doing an fsync() really hurts, - with SSDs/flash it reduces
    the lifetime of the storage, for many modern filesystems it is a
    costly operation which bloats the metadata tree significantly,
    resulting in all further operations becomes inefficient.

    How about turning this option - force-unsafe-io - to on by default
    in 2025? That would be a great present for 2025 New Year! :)

    Thanks,

    /mjt

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nikolaus Rath@21:1/5 to Simon Richter on Mon Dec 30 21:50:01 2024
    Simon Richter <sjr@debian.org> writes:
    The order of operation needs to be

    1. create .dpkg-new file
    2. write data to .dpkg-new file
    3. link existing file to .dpkg-old
    4. rename .dpkg-new file over final file name
    5. clean up .dpkg-old file

    When we reach step 4, the data needs to be written to disk and the metadata in
    the inode referenced by the .dpkg-new file updated, otherwise we atomically replace the existing file with one that is not yet guaranteed to be written out.

    If a system crashed while dpkg was installing a package, then my
    assumption has always been that it's possible that at least this package
    is corrupted.

    You seem to be saying that dpkg needs to make sure that the package is installed correctly even when this happens. Is that right?

    If so, is dpkg also doing something to prevent a partial update across
    multiple files (i.e., some files in the package are upgraded while
    others are not)? If not, then I wonder why having an empty file is worse
    than having one with outdated contents?


    Are these guarantees documented somewhere and I've just never read it?
    Or is everyone else expecting more reliability from dpkg than I do by
    default?


    Best,
    -Nikolaus

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_Plissonneau_Duqu=C@21:1/5 to All on Mon Dec 30 22:30:02 2024
    Hi,

    Le 2024-12-30 21:38, Nikolaus Rath a écrit :

    If a system crashed while dpkg was installing a package, then my
    assumption has always been that it's possible that at least this
    package
    is corrupted.

    The issue here is that without the fsync there is a risk that such
    corruption occurs even if the system crashes _after_ dpkg has finished
    (or finished installing a package).

    What happens in that case is that the metadata (file/link creations,
    renames, unlinks) can be written to the filesystem journal several
    seconds before the data is written to its destination blocks. But for
    security reasons the length of the created file is only updated after
    the data is actually written. This is why instead of getting files with
    random corrupted data you get truncated files if the crash or power loss
    occurs between both writes.

    There is no way to know which are the "not fully written" packages in
    these cases, short of verifying all installed files of all (re)installed/down/upgraded packages of recent runs of dpkg (which could
    be a feature worth having on a recovery bootable image).

    Cheers,

    --
    Julien Plissonneau Duquène

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_Plissonneau_Duqu=C@21:1/5 to All on Tue Dec 24 11:20:01 2024
    Hi,

    Le 2024-12-24 10:54, Michael Tokarev a écrit :

    How about turning this option - force-unsafe-io - to on by default
    in 2025? That would be a great present for 2025 New Year! :)

    That sounds like a sensible idea to me.

    Cheers, and best wishes,

    --
    Julien Plissonneau Duquène

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?utf-8?Q?Hakan_Bay=C4=B1nd=C4=B1r?@21:1/5 to All on Tue Dec 24 13:10:01 2024
    Hi Michael,

    That sounds like a neat idea. Especially, with the proliferation of more complex filesystems like BTRFS, the penalty of calling fsyc() a lot becomes very visible. I’m not a BTRFS user myself, but I always hear comments and discuss about it.

    Removing this workaround can help to remove the myth that apt is slow.

    Happy new year,

    Cheers,

    Hakan

    On 24 Dec 2024, at 12:54, Michael Tokarev <mjt@tls.msk.ru> wrote:

    Hi!

    The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
    issues, where a power-cut in the middle of filesystem metadata
    operation (which dpkg does a lot) might result in in unconsistent
    filesystem state. This workaround slowed down dpkg operations
    quite significantly (and has been criticised due to that a lot,
    the difference is really significant).

    The workaround is to issue fsync() after almost every filesystem
    operation, instead of after each transaction as dpkg did before.

    Once again: dpkg has always been doing "safe io", the workaround
    was needed for ext2fs only, - it was the filesystem which was
    broken, not dpkg.

    Today, doing an fsync() really hurts, - with SSDs/flash it reduces
    the lifetime of the storage, for many modern filesystems it is a
    costly operation which bloats the metadata tree significantly,
    resulting in all further operations becomes inefficient.

    How about turning this option - force-unsafe-io - to on by default
    in 2025? That would be a great present for 2025 New Year! :)

    Thanks,

    /mjt

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Richter@21:1/5 to Michael Tokarev on Tue Dec 24 15:20:01 2024
    Hi,

    On 12/24/24 18:54, Michael Tokarev wrote:

    The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
    issues, where a power-cut in the middle of filesystem metadata
    operation (which dpkg does a lot) might result in in unconsistent
    filesystem state.

    The thing it protects against is a missing ordering of write() to the
    contents of an inode, and a rename() updating the name referring to it.

    These are unrelated operations even in other file systems, unless you
    use data journaling ("data=journaled") to force all operations to the
    journal, in order. Normally ("data=ordered") you only get the metadata
    update marking the data valid after the data has been written, but with
    no ordering relative to the file name change.

    The order of operation needs to be

    1. create .dpkg-new file
    2. write data to .dpkg-new file
    3. link existing file to .dpkg-old
    4. rename .dpkg-new file over final file name
    5. clean up .dpkg-old file

    When we reach step 4, the data needs to be written to disk and the
    metadata in the inode referenced by the .dpkg-new file updated,
    otherwise we atomically replace the existing file with one that is not
    yet guaranteed to be written out.

    We get two assurances from the file system here:

    1. the file will not contain garbage data -- the number of bytes marked
    valid will be less than or equal to the number of bytes actually
    written. The number of valid bytes will be zero initially, and only
    after data has been written out, the metadata update to change it to the
    final value is added to the journal.

    2. creation of the inode itself will be written into the journal before
    the rename operation, so the file never vanishes.

    What this does not protect against is the file pointing to a zero-size
    inode. The only way to avoid that is either data journaling, which is
    horribly slow and creates extra writes, or fsync().

    Today, doing an fsync() really hurts, - with SSDs/flash it reduces
    the lifetime of the storage, for many modern filesystems it is a
    costly operation which bloats the metadata tree significantly,
    resulting in all further operations becomes inefficient.

    This should not make any difference in the number of write operations necessary, and only affect ordering. The data, metadata journal and
    metadata update still have to be written.

    The only way this could be improved is with a filesystem level
    transaction, where we can ask the file system to perform the entire
    update atomically -- then all the metadata updates can be queued in RAM,
    held back until the data has been synchronized by the kernel in the
    background, and then added to the journal in one go. I would expect that
    with such a file system, fsync() becomes cheap, because it would just be
    added to the transaction, and if the kernel gets around to writing the
    data before the entire transaction is synchronized at the end, it
    becomes a no-op.

    This assumes that maintainer scripts can be included in the transaction (otherwise we need to flush the transaction before invoking a maintainer script), and that no external process records the successful execution
    and expects it to be persisted (apt makes no such assumption, because it
    reads the dpkg status, so this is safe, but e.g. puppet might become
    confused if an operation it marked as successful is rolled back by a
    power loss).

    What could make sense is more aggressively promoting this option for
    containers and similar throwaway installations where there is a
    guarantee that a power loss will have the entire workspace thrown away,
    such as when working in a CI environment.

    However, even that is not guaranteed: if I create a Docker image for
    reuse, Docker will mark the image creation as successful when the
    command returns. Again, there is no ordering guarantee between the
    container contents and the database recording the success of the
    operation outside.

    So no, we cannot drop the fsync(). :\

    Simon

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Stone@21:1/5 to Nikolaus Rath on Tue Dec 31 15:10:01 2024
    On Mon, Dec 30, 2024 at 08:38:17PM +0000, Nikolaus Rath wrote:
    If a system crashed while dpkg was installing a package, then my
    assumption has always been that it's possible that at least this package
    is corrupted.

    You seem to be saying that dpkg needs to make sure that the package is >installed correctly even when this happens. Is that right?

    dpkg tries really hard to make sure the system is *still functional*
    when this happens. If you skip the syncs you may end up with a system
    where (for example) libc6 is effectively gone and the system won't boot
    back up. There may certainly still be issues where an upgrade is in
    progress and some of the pieces don't interact properly, but the intent
    is that should result in a system which can be fixed by completing the
    install, vs issues which prevent the system from getting to the point of completing the install. Skip enough syncs and you may not even be able
    to recover via a rescue image without totally reinstalling the system
    because dpkg wouldn't even know the state of the packages.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Soren Stoutner@21:1/5 to All on Tue Dec 31 10:32:09 2024
    On Tuesday, December 31, 2024 10:16:32 AM MST Michael Stone wrote:
    On Tue, Dec 31, 2024 at 05:31:36PM +0100, Sven Mueller wrote:
    It feels wrong to me to justify such a heavy performance penalty this way
    if

    Well, I guess we'd have to agree on the definition of "heavy performance penalty". I have not one debian system where dpkg install time is a bottleneck.

    On my system, which has a Western Digital Black SN850X NVMe (PCIe 4) formatted ext4, dpkg runs really fast (and feels like it runs faster than it did a few years ago on similar hardware). There has been much talk on this list about performance penalties with dpkg’s current configuration, and some requests for
    actual benchmark data showing those performance penalties. So far, nobody has produced any numbers showing that those penalties exist or how significant they
    are. As I don’t experience anything I could describe as a performance problem
    on any of my systems, I think the burden of proof is on those who are experiencing those problems to demonstrate them concretely before we need to spend effort trying to figure out what changes should be made to address them.

    --
    Soren Stoutner
    soren@debian.org
    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEJKVN2yNUZnlcqOI+wufLJ66wtgMFAmd0KpkACgkQwufLJ66w tgPt3Q//Y7GRcnygZWJJrPMgLlvswNrVeMbeExqr5xVKPPfd2BXRbkQazG6c/+EW 1rMwTUXGCy6YJw2pybSYnlIHindc6RHFQBGjNe++AyVEJfIOQOdFlYZQzSle6gaP 2AT1vwaPSsXUgLn7yJRqHVs9A4kQPNLsALNKjOxA/wAnx+i0Yqo4Y5UYfGzpp11x HxUgw2EfCt3H+LQvk3Oj0VesUeszagPbKjElkWU8NFNbtYEpGhEmnqLlbBNjFHx9 o90TwgiNzm4D8SSeq9/32Y7OfGDayc5bA3NhzggIFw6mg7hSs+cKl0+vEbSn3hE9 KJ03xCywuDUmhNUAoE0603M/u8kt/BJp3tCnrkLdrrrRxAjkAvgM7QQ17UGd9O+R l0vC3F4vEcTXvsO0dIOkkmlCEIoGq16c0FrEVJU/d6ICKjw29GkLo5alXFSs/ZjD bCd3bBfjiMqtZHJiP3TBRchZfd7Rtpkab2iqLP07TpxUacd9GJX9kkaQR62MXy5v Q4menVYGWNvFTHRdTHJICnSxakQGzCpGcNlbQ53n+vmdA1108aHNKq4wycaXj7XL qtiPVPs2tp3uZjQw5ISHi+LVf19Uw7A3FVE215lPuzfg/H7fkMreEOnTFOXoKkff GPnKYQSiWtAoXmSzx6U591g0Ke+HxU3wzZpkadaHoO9eoyI0aoo=
    =4JwB
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Soren Stoutner@21:1/5 to All on Tue Dec 31 10:54:38 2024
    On Tuesday, December 31, 2024 10:44:58 AM MST Marc Haber wrote:
    On Tue, Dec 31, 2024 at 10:32:09AM -0700, Soren Stoutner wrote:
    On my system, which has a Western Digital Black SN850X NVMe (PCIe 4) formatted ext4, dpkg runs really fast (and feels like it runs faster than it did a few years ago on similar hardware). There has been much talk on this list about performance penalties with dpkg’s current configuration, and some requests for actual benchmark data showing those performance penalties.

    Doing fsyncs to often after tiny writes will also cause write
    amplification on the SSD.

    I should use eatmydata more often.

    As pointed out earlier in the thread, the answer to the question of how fsyncs affect SSD lifespan can vary a lot across SSD hardware controllers, because not
    every fsync results in actual writes to the flash storage, and the SSD hardware
    controller is often the one who decided if they do or not.

    In the case of my SSD, total TB written so far over the lifetime of the drive is 4.25 TB.

    This drive is about 5 months old. It has a rated lifetime endurance of 1,200 TBW (Terabytes Written). So, assuming that rating is accurate, it can run under the current load for 117 years.

    --
    Soren Stoutner
    soren@debian.org
    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEJKVN2yNUZnlcqOI+wufLJ66wtgMFAmd0L94ACgkQwufLJ66w tgMHHBAAvMQ4KH+nIu3qgQ23rMBVr/a1+DpENklg3BfHZZaNnjPfvZzCBgPvO9er 4jxcgG1+Tw5xYZV0XwssOhvAKPjcNoTHg56wVv6S1Db/Asl8scNwgvCyVgSTBuRq nieZn16qB5JYKiDTlclZTpJMQqDnBT/BcHoWrh2OnKy7OQuHQgu5J61JKJmXDxAC bku5HA+xtardoMTcJToEVwqrispgfiH5Tw26GpVf3pXslSUpDiJvqRhQV5sStvKD ZZFrJmCHnt4j+JTlDukYJcURp6uMzN6S/h61hsAbPGmnMc4Dm7X2+SMzZS6lWWPN LFJ//hLslR1koXT8o0TEiqOwBWrmBO9YQV0mZjA0xHwPEQ9CUk1DZcn4f+JeGdAq w8JUHw2+EB563j2CHGrnCsk1zV66WHPvVGCLCl4k0/Orqpbnol5HCL65g5BcoARt XpZrXldf81g2JQUlCFMVXmvLErS0vaM1Uf0pb/6q7qrVCVEfoIQpkzTyLb7J7Ezo ng1zwdg57AjFshDZf71o9YeyJhCWnbPqTOfFpjPWWSjH/V7L92rzVe8u4lZJHZGM TCt721UfHmMuRexsA5eekoxiTmdoTNHXovXof0/GuV2WaTBdaew/82f14rwRpiXE 0i9WqQByrkbkKamC8vSeZ0FLn0/t89VnXjorXQUFrxRHAApMJiM=
    =hOAy
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marc Haber@21:1/5 to Soren Stoutner on Tue Dec 31 18:50:01 2024
    On Tue, Dec 31, 2024 at 10:32:09AM -0700, Soren Stoutner wrote:
    On my system, which has a Western Digital Black SN850X NVMe (PCIe 4) formatted
    ext4, dpkg runs really fast (and feels like it runs faster than it did a few years ago on similar hardware). There has been much talk on this list about performance penalties with dpkg’s current configuration, and some requests for
    actual benchmark data showing those performance penalties.

    Doing fsyncs to often after tiny writes will also cause write
    amplification on the SSD.

    I should use eatmydata more often.

    Greetings
    Marc

    -- ----------------------------------------------------------------------------- Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Stone@21:1/5 to Sven Mueller on Tue Dec 31 18:20:02 2024
    On Tue, Dec 31, 2024 at 05:31:36PM +0100, Sven Mueller wrote:
    It feels wrong to me to justify such a heavy performance penalty this way if

    Well, I guess we'd have to agree on the definition of "heavy performance penalty". I have not one debian system where dpkg install time is a
    bottleneck.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Stone@21:1/5 to Marc Haber on Tue Dec 31 21:50:01 2024
    On Tue, Dec 31, 2024 at 06:44:58PM +0100, Marc Haber wrote:
    On Tue, Dec 31, 2024 at 10:32:09AM -0700, Soren Stoutner wrote:
    On my system, which has a Western Digital Black SN850X NVMe (PCIe 4) formatted
    ext4, dpkg runs really fast (and feels like it runs faster than it did a few >> years ago on similar hardware). There has been much talk on this list about >> performance penalties with dpkg’s current configuration, and some requests for
    actual benchmark data showing those performance penalties.

    Doing fsyncs to often after tiny writes will also cause write
    amplification on the SSD.

    The two year old NVMe drive in my primary desktop (which follows sid and
    is updated at least once per day--far more dpkg activity than any normal system) reports 21TB written/3% of the drive's expected endurance. There
    is no possibility that I will hit that limit before the drive becomes completely obsolete e-waste.

    For this to be an actual problem rather than a (questionable)
    theoretical issue would require someone to be doing continuous dpkg
    upgrades to a low-write-endurance SD card...which AFAIK isn't a thing
    actual people do. dpkg simply isn't the kind of tool which will cause
    issues on an ssd in any reasonable scenario. If this is really a concern
    for you, look for tools doing constant syncs (a good example is older
    browsers which constantly saved small changes to configuration databases
    which could amount to 10s of GB per day); don't look at a tool which in
    typical operation doesn't write more than a few megabytes per day.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Tokarev@21:1/5 to Simon Richter on Thu Dec 26 10:10:01 2024
    24.12.2024 17:10, Simon Richter wrote:
    Hi,

    On 12/24/24 18:54, Michael Tokarev wrote:

    The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
    issues, where a power-cut in the middle of filesystem metadata
    operation (which dpkg does a lot) might result in in unconsistent
    filesystem state.

    The thing it protects against is a missing ordering of write() to the contents of an inode, and a rename() updating the name referring to it.

    These are unrelated operations even in other file systems, unless you use data journaling ("data=journaled") to force all operations to the journal,
    in order. Normally ("data=ordered") you only get the metadata update marking the data valid after the data has been written, but with no ordering
    relative to the file name change.

    The order of operation needs to be

    1. create .dpkg-new file
    2. write data to .dpkg-new file
    3. link existing file to .dpkg-old
    4. rename .dpkg-new file over final file name
    5. clean up .dpkg-old file

    When we reach step 4, the data needs to be written to disk and the metadata in the inode referenced by the .dpkg-new file updated, otherwise we
    atomically replace the existing file with one that is not yet guaranteed to be written out.

    This brings up a question: how dpkg worked before ext2fs started showing this zero-length files behavior? Iirc it was rather safe, no?

    What you're describing seems reasonable. But I wonder if we can do better here.

    How about doing steps 1..3 for *all* files in the package, and only
    after that, do a single fsync() and do remaining steps 4..5, again,
    for all files?

    Thanks,

    /mjt

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_Plissonneau_Duqu=C@21:1/5 to All on Thu Dec 26 10:40:01 2024
    Hi,

    Le 2024-12-24 15:10, Simon Richter a écrit :

    This should not make any difference in the number of write operations necessary, and only affect ordering. The data, metadata journal and
    metadata update still have to be written.

    I would expect that some reordering makes it possible for fewer actual
    physical write operations to happen, i.e. writes to same/neighbouring
    blocks get merged/grouped (eventually by the hardware if not the kernel)
    which would make a difference on both spinning devices performance (less
    seeks) and solid state devices longevity (as these have larger physical blocks), but I don't know if that's actually how it works in that case.

    One way to know would be to bench what actually happens nowadays with
    and without --force-unsafe-io to get some actual numbers to weigh in the decision to make the change or not.

    It would be surprising though that the dpkg man pages (among other
    places) talks about performance degradations if these were not real.

    The only way this could be improved is with a filesystem level
    transaction, where we can ask the file system to perform the entire
    update atomically -- then all the metadata updates can be queued in
    RAM, held back until the data has been synchronized by the kernel in
    the background, and then added to the journal in one go. I would expect
    that with such a file system, fsync() becomes cheap, because it would
    just be added to the transaction, and if the kernel gets around to
    writing the data before the entire transaction is synchronized at the
    end, it becomes a no-op.

    That sounds interesting. But — do we have filesystems on Linux that can
    do that already, or is this still a wishlist item? Also worth noting, at
    least one well-known implementation in another OS was deprecated [1]
    citing complexity and lack of popularity as the reasons for that
    decision, and the feature is missing in their next-gen FS. So maybe it's
    not that great after all?

    Anyway in the current toolbox besides --force-unsafe-io we also have:
    - volume or FS snapshots, for similar or better safety but not the
    automatic performance gains; probably not (yet?) available on most
    systems
    - the auto_da_alloc ext4 mount option that AIUI should do The Right
    Thing in dpkg's use case even without the fsync, actual reliability and performance impact unknown; appears to be set by default on trixie
    - eatmydata
    - io_uring that allows asynchronous file operations; implementation
    would require important changes in dpkg; potential performance gains in
    dpkg's use case are not yet evaluated AFAIK but it looks like the right solution for that use case.

    BTW for those interested in reading a bit more about the historical and
    current context around this issue aka O_PONIES I'm adding a few links at
    [2].

    but e.g. puppet might become confused

    Heh. Ansible wins again.

    So no, we cannot drop the fsync(). :\

    Nowadays, most machines are unlikely to be subject to power failures at
    the worst time: laptops or other mobile devices that have batteries have replaced desktop PCs in many workplaces and homes, and machines in
    datacenters usually have redundant power supplies and
    batteries+generators backups. And the default filesystem for new
    installations, ext4, is mounted with auto_da_alloc by default which
    should make this drop safe, but whether that will result in significant performance gains is IMO something to be tested.

    If the measured performance gain makes it interesting to drop the fsync,
    maybe this could become a configuration item that is set automatically
    in most cases by detecting the machine type (battery, dual PSU,
    container, VM => drop fsync) and filesystem (safe fs and mount options
    drop fsync) or by asking the user in other cases or in expert install
    mode, defaulting to the safer --no-force-unsafe-io.

    Cheers,


    [1]:
    https://learn.microsoft.com/en-us/windows/win32/fileio/deprecation-of-txf
    [2]: https://lwn.net/Articles/351422/
    https://lwn.net/Articles/322823/
    https://lwn.net/Articles/1001770/

    --
    Julien Plissonneau Duquène

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nikolaus Rath@21:1/5 to sre4ever@free.fr on Wed Jan 1 16:20:01 2025
    Julien Plissonneau Duquène <sre4ever@free.fr> writes:

    Hi,

    Le 2024-12-30 21:38, Nikolaus Rath a écrit :
    If a system crashed while dpkg was installing a package, then my
    assumption has always been that it's possible that at least this package
    is corrupted.

    The issue here is that without the fsync there is a risk that such corruption occurs even if the system crashes _after_ dpkg has finished (or finished installing a package).

    That is not my understanding of the issue. The proposal was to disable
    fsync after individual files have been unpacked, i.e. multiple times per package. Not about one final fsync just before dpkg exits.

    Best,
    -Nikolaus

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Stone@21:1/5 to Nikolaus Rath on Wed Jan 1 18:10:01 2025
    On Wed, Jan 01, 2025 at 03:15:25PM +0000, Nikolaus Rath wrote:
    Julien Plissonneau DuquΦne <sre4ever@free.fr> writes:
    Le 2024-12-30 21:38, Nikolaus Rath a Θcritá:
    If a system crashed while dpkg was installing a package, then my
    assumption has always been that it's possible that at least this package >>> is corrupted.

    The issue here is that without the fsync there is a risk that such corruption
    occurs even if the system crashes _after_ dpkg has finished (or finished
    installing a package).

    That is not my understanding of the issue. The proposal was to disable
    fsync after individual files have been unpacked, i.e. multiple times per >package. Not about one final fsync just before dpkg exits.

    You seem to be assuming that dpkg is only processing a single package at
    a time? Doing so would not be an efficiency gain, IMO.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Soren Stoutner@21:1/5 to All on Wed Jan 1 11:33:34 2025
    On Wednesday, January 1, 2025 11:27:38 AM MST Aurélien COUDERC wrote:
    Le mardi 31 décembre 2024, 18:32:09 UTC+1 Soren Stoutner a écrit :
    On Tuesday, December 31, 2024 10:16:32 AM MST Michael Stone wrote:
    On Tue, Dec 31, 2024 at 05:31:36PM +0100, Sven Mueller wrote:
    It feels wrong to me to justify such a heavy performance penalty this
    way

    if

    Well, I guess we'd have to agree on the definition of "heavy performance penalty". I have not one debian system where dpkg install time is a bottleneck.

    So far, nobody has
    produced any numbers showing that those penalties exist or how significant they are. As I don’t experience anything I could describe as a
    performance
    problem on any of my systems, I think the burden of proof is on those who are experiencing those problems to demonstrate them concretely before we need to spend effort trying to figure out what changes should be made to address them.
    Here’s a quick « benchmark » in a sid Plasma desktop qemu VM where I had a
    snapshot of up-to-date sid from Nov 24th, upgrading to today’s sid :

    Summary:
    Upgrading: 658, Installing: 304, Removing: 58, Not Upgrading: 2
    Download size: 0 B / 1 032 MB
    Space needed: 573 MB / 9 051 MB available

    # time apt -y full-upgrade

    real 9m49,143s
    user 2m16,581s
    sys 1m17,361s

    # time eatmydata apt -y full-upgrade

    real 3m25,268s
    user 2m26,820s
    sys 1m16,784s

    That’s close to a *3 times faster* wall clock time when run with eatmydata.

    The measurements are done after first running apt --download-only and taking the VM snapshot to avoid network impact. The VM installation is running
    plain
    ext4 with 4 vCPU / 4 GiB RAM.
    The host was otherwise idle. It runs sid on btrfs with default mount options on top of LUKS with the discard flag set. The VM’s qcow2 file is flagged with
    the C / No_COW xattr. It’s a recent Ryzen system with plenty of free RAM / disk space.

    While I don’t have a setup to quickly reproduce an upgrade on the bare metal
    host in my experience I see comparable impacts. And I’ve experienced similar
    behaviours on other machines.


    I won’t pretend I know what I’m doing, so I’m probably doing it wrong and my
    installs are probably broken in some obvious way. You were asking for data
    so
    here you go with a shiny data point. :)

    That is an interesting data point. Could you also run with --force-unsafe-io instead of eatmydata? I don’t know if there would be much of a difference (hence the reason for the need of a good benchmark), but as the proposal here is to enable --force-unsafe-io by default instead of eatmydata it would be interesting to see what the results of that option would be.

    --
    Soren Stoutner
    soren@debian.org
    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEJKVN2yNUZnlcqOI+wufLJ66wtgMFAmd1in4ACgkQwufLJ66w tgNylg//QeTEJXDVTXpwsedw44I1l7i5U9Cz7yFa25rzZn1TiCN1jQTFApPd7qRs pa0Pg/QSYE9OhDsJwurkiJxqBzfWRjZ3tuk2WfnUR7EUmXWpBQ15KYqV74MwvL6h CRLzdNdhQm+NAdv32pNXflvQrlai2H5is/F0cC/trsmm6sXIWA44NxGVHtV7Z8Tv 6JY3kltO2FX2shPDnug9GvqdM4zHQEH2j5y68/ceCirgwVVd717Rgz2vJxajZuCc ob7XGampuGnbuRTX8F2D00AXTVXckSYmgCtaTg9bkjsWddtAvXHtIbKyi4MYS6Vx MxCADyfgy0Puf/2+4OnA99rY9bSfHmMJ1zWNERaEOyXDQn3uGb+WLxuqnxEKZxz5 XEMIdqHKcnrQSsbM6obZzT19mCfrMge9Gq/4cTVVDXwlFCkypoz7H/mH+EwFHEHB 8wE2680MpezBRgrX3KC4egBIUwdiNMbLMhzlrKnXB3DHQ/kvFbLoSybUOjdhGTa+ puwbFmsY30Wuc3WXQ5danM7xy4u5ekPDiXLaw7ewBDziudOcvTJ6ULw9hbUeHTKZ qE2xSDzW04eYoQ2ofNLbXiue4MSsYRmTT6yKc4BgIQ7YWyzZj6VsGLk+QGxThqWM o4hteon+01l+TemXxHWnOMIk8j50z27fFe+ad+GgBfY9fJtVcDs=
    =7swE
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?B?QXVyw6lsaWVu?= COUDERC@21:1/5 to All on Wed Jan 1 19:30:02 2025
    Le mardi 31 décembre 2024, 18:32:09 UTC+1 Soren Stoutner a écrit :
    On Tuesday, December 31, 2024 10:16:32 AM MST Michael Stone wrote:
    On Tue, Dec 31, 2024 at 05:31:36PM +0100, Sven Mueller wrote:
    It feels wrong to me to justify such a heavy performance penalty this way
    if

    Well, I guess we'd have to agree on the definition of "heavy performance penalty". I have not one debian system where dpkg install time is a bottleneck.

    So far, nobody has
    produced any numbers showing that those penalties exist or how significant they
    are. As I don’t experience anything I could describe as a performance problem
    on any of my systems, I think the burden of proof is on those who are experiencing those problems to demonstrate them concretely before we need to spend effort trying to figure out what changes should be made to address them.

    Here’s a quick « benchmark » in a sid Plasma desktop qemu VM where I had a snapshot of up-to-date sid from Nov 24th, upgrading to today’s sid :

    Summary:
    Upgrading: 658, Installing: 304, Removing: 58, Not Upgrading: 2
    Download size: 0 B / 1 032 MB
    Space needed: 573 MB / 9 051 MB available

    # time apt -y full-upgrade

    real 9m49,143s
    user 2m16,581s
    sys 1m17,361s

    # time eatmydata apt -y full-upgrade

    real 3m25,268s
    user 2m26,820s
    sys 1m16,784s

    That’s close to a *3 times faster* wall clock time when run with eatmydata.

    The measurements are done after first running apt --download-only and taking the VM snapshot to avoid network impact.
    The VM installation is running plain ext4 with 4 vCPU / 4 GiB RAM.
    The host was otherwise idle. It runs sid on btrfs with default mount options on top of LUKS with the discard flag set. The VM’s qcow2 file is flagged with the C / No_COW xattr. It’s a recent Ryzen system with plenty of free RAM / disk space.

    While I don’t have a setup to quickly reproduce an upgrade on the bare metal host in my experience I see comparable impacts. And I’ve experienced similar behaviours on other machines.


    I won’t pretend I know what I’m doing, so I’m probably doing it wrong and my installs are probably broken in some obvious way. You were asking for data so here you go with a shiny data point. :)


    Happy new year,
    --
    Aurélien

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ahmad Khalifa@21:1/5 to All on Wed Jan 1 20:20:01 2025
    On 01/01/2025 18:27, Aurélien COUDERC wrote:
    Here’s a quick « benchmark » in a sid Plasma desktop qemu VM where I had a snapshot of up-to-date sid from Nov 24th, upgrading to today’s sid :

    Summary:
    Upgrading: 658, Installing: 304, Removing: 58, Not Upgrading: 2
    Download size: 0 B / 1 032 MB
    Space needed: 573 MB / 9 051 MB available

    # time apt -y full-upgrade

    real 9m49,143s
    user 2m16,581s
    sys 1m17,361s

    # time eatmydata apt -y full-upgrade

    real 3m25,268s
    user 2m26,820s
    sys 1m16,784s

    That’s close to a *3 times faster* wall clock time when run with eatmydata.

    The second upgrade hasn't been written back at all. For up to 30
    seconds, if your machine loses power and for example grub.{cfg,efi} or
    shim.efi was updated, you can't boot at all. Not to mention Kernel
    panics still happen.

    Surely, we can all agree dpkg should at least do a single 'sync' at the
    end? Perhaps this would be a more realistic comparison?
    # time (eatmydata apt -y full-upgrade; sync)

    I wrote something to compare a single `dpkg --install`. This is on a
    stable chroot where I downloaded 2 packages. Installing and purging with/without eatmydata, but with an explicit sync.
    `dpkg --install something; sync`
    vs.
    `eatmydata dpkg --install something; sync`

    The two .debs were
    -rw-r--r-- 1 root root 138K Jan 1 18:29 fdisk_2.38.1-5+deb12u2_amd64.deb -rw-r--r-- 1 root root 12M Jan 1 18:29 firmware-amd-graphics_20230210-5_all.deb


    Running on an OKish SSD:
    Warmup... Timing dpkg with fdisk
    .05 .04 .04
    Warmup... Timing eatmydata dpkg with fdisk
    .03 .02 .02
    Warmup... Timing dpkg with firmware-amd-graphics
    .46 .47 .45
    Warmup... Timing eatmydata dpkg with firmware-amd-graphics
    .63 .63 .67

    Running on the slowest mechanical HDD I have:
    Warmup... Timing dpkg with fdisk
    .09 .08 .11
    Warmup... Timing eatmydata dpkg with fdisk
    .05 .05 .05
    Warmup... Timing dpkg with firmware-amd-graphics
    3.46 3.58 3.46
    Warmup... Timing eatmydata dpkg with firmware-amd-graphics
    2.95 2.99 3.06

    Not sure why SSD with eatmydata takes longer, but eatmydata is faster consistently on a mechanical, but still only a 15% improvement.

    Script snippet here:
    https://salsa.debian.org/-/snippets/765

    --
    Regards,
    Ahmad

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?B?QsOhbGludCBSw6ljemV5?=@21:1/5 to All on Wed Jan 1 21:40:01 2025
    Hi,

    Marc Haber <mh+debian-devel@zugschlus.de> ezt írta (időpont: 2024.
    dec. 31., K, 18:44):

    On Tue, Dec 31, 2024 at 10:32:09AM -0700, Soren Stoutner wrote:
    On my system, which has a Western Digital Black SN850X NVMe (PCIe 4) formatted
    ext4, dpkg runs really fast (and feels like it runs faster than it did a few
    years ago on similar hardware). There has been much talk on this list about
    performance penalties with dpkg’s current configuration, and some requests for
    actual benchmark data showing those performance penalties.

    Doing fsyncs to often after tiny writes will also cause write
    amplification on the SSD.

    I should use eatmydata more often.

    I also use eatmydata time to time where it is safe, but sometimes I
    forget, this is why I packaged the snippet to make all apt runs use
    eatmydata automatically: https://salsa.debian.org/debian/apt-eatmydata/-/blob/master/debian/control?ref_type=heads

    I'll upload it when apt also gets a necessary fix to make removing the
    snippet safe:
    https://salsa.debian.org/apt-team/apt/-/merge_requests/419

    There is an equivalent simple solution for GitHub Actions as well: https://github.com/marketplace/actions/apt-eatmydata

    I'll write a short blog post about those when apt-eatmydata gets
    accepted to the archive.

    Happy New Year!

    Cheers,
    Balint

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?B?QXVyw6lsaWVu?= COUDERC@21:1/5 to All on Thu Jan 2 01:20:01 2025
    Le mercredi 1 janvier 2025, 19:33:34 UTC+1 Soren Stoutner a écrit :

    That is an interesting data point. Could you also run with --force-unsafe-io
    instead of eatmydata? I don’t know if there would be much of a difference (hence the reason for the need of a good benchmark), but as the proposal here
    is to enable --force-unsafe-io by default instead of eatmydata it would be interesting to see what the results of that option would be.

    Sure but I wouldn’t know how to do that since I’m calling apt and force-unsafe-io seems to be a dpkg option ?


    Thanks,
    --
    Aurélien

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Soren Stoutner@21:1/5 to All on Wed Jan 1 17:27:13 2025
    On Wednesday, January 1, 2025 5:00:10 PM MST Aurélien COUDERC wrote:
    Le mercredi 1 janvier 2025, 19:33:34 UTC+1 Soren Stoutner a écrit :
    That is an interesting data point. Could you also run with --force-unsafe-io
    instead of eatmydata? I don’t know if there would be much of a difference
    (hence the reason for the need of a good benchmark), but as the proposal here
    is to enable --force-unsafe-io by default instead of eatmydata it would be interesting to see what the results of that option would be.

    Sure but I wouldn’t know how to do that since I’m calling apt and force-unsafe-io seems to be a dpkg option ?

    Can’t you just take the list of packages you have already downloaded with apt
    and install them with dpkg instead?

    The speed differential you have demonstrated with eatmydata is significant. I don’t know if --force-unsafe-io will produce the same speed differential or not, but if it does then I think you have met the criteria for it being worth our while to see if we can safely adopt at least some aspects of --force- unsafe-io, at least on some file systems.

    --
    Soren Stoutner
    soren@debian.org
    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEJKVN2yNUZnlcqOI+wufLJ66wtgMFAmd13WEACgkQwufLJ66w tgP3Uw/+M3yq+JP/ptShyGPnKAYPXNfYe8dJSVIGeFy/sxqybXWLUiZ55BgfOuTm OY7zCN6t3E7MKpbgVHMRUu4P8zqDORke7a2Exv+HETkal5gaVkt7RNPSDpqHzSt9 0f4kbWfnShtfyWmfFvSA3eVi63j+CzwKRkV2RS0mbWcJZJojoqtG0RYDxIOQ5DTV w8aG7TScZi7owriee+d1X3pppxowCWAzfhIQRQCzRrin4sIha4gTTdEn9Enp8ug5 sskNh6t333bMjXwS8KH5avLpd4lK0yJNav5r89CYNfn9+6EYM/nAf4enaT9l+eUz P4xl8rHk1Pfl+Es2tsUhrWdi1spD/qEYCc5qTUbXjmCaGaorOZwnPOSMkYNC2NrF P1wGZNu0MiX1emQ0vn6tMm+hTCKRcy80+fIZoRP7JG6WDdU9hTrY4ze7RPY3NeqP XKd17wiegWg/D+2xQtyaP4EnUwV5IuR5BbtPy297c/YQWcHQy+Zte0X3PqyY7/gI fjoxMtsmg1EFl0vAhmAOpIs1/4QYeRt9FEflpBes5YhX+4JZBSD8FcAsn8Z6Q0K8 wr7P3mp6cSGCpq2ny0u1BwO8r14GomcxgK1jGJ7K9xfdBtXsX0CJj7wxckmBQ1wM ozEySy726LT6bDyaGBGXOK2kDRUrRY2BGfbegrIHsBW2chrBbvU=
    =wfu+
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_Plissonneau_Duqu=C@21:1/5 to All on Thu Jan 2 08:50:01 2025
    Hi,

    Le 2025-01-02 01:00, Aurélien COUDERC a écrit :

    That is an interesting data point. Could you also run with
    --force-unsafe-io

    Sure but I wouldn’t know how to do that since I’m calling apt and force-unsafe-io seems to be a dpkg option ?

    You could try adding -o dpkg::options=--force-unsafe-io to the command
    line.

    Cheers,

    --
    Julien Plissonneau Duquène

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_Plissonneau_Duqu=C@21:1/5 to All on Thu Jan 2 09:00:02 2025
    Hi,

    Le 2025-01-01 16:15, Nikolaus Rath a écrit :

    That is not my understanding of the issue. The proposal was to disable
    fsync after individual files have been unpacked, i.e. multiple times
    per
    package. Not about one final fsync just before dpkg exits.

    The way fsync works, that would still be multiple times per package
    anyway, as fsync has to be called for every file to be written. The way
    dpkg works, there is no such thing as "one final fsync before exit":
    dpkg processes packages sequentially and commits the writes after
    processing each package [1]. There are however already some existing optimizations that have been reported in this thread [2], notably this
    one:

    * Then we reworked the code to defer and batch all the fsync()s for
    a specific package after all the file writes, and before the
    renames,
    which was a bit better but not great.

    The code in question is there [3] btw, if anyone wants to take a look.
    After reading that and current ext4 features and default mount options
    it seems now likely to me that not much (if any) performance
    improvements or write amplification reductions are to be expected from --force-unsafe-io alone. I'm now waiting for our very welcome volunteer
    to come back with numbers that will hopefully end that cliffhanger.

    Cheers,


    [1] and there is potential for optimizations there, but getting them 1.
    to just work and then 2. to be at least as safe as the current code is
    not exactly going to be trivial.
    [2]: https://lists.debian.org/debian-devel/2024/12/msg00597.html
    [3]:
    https://sources.debian.org/src/dpkg/1.22.11/src/main/archives.c/#L1159

    --
    Julien Plissonneau Duquène

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_Plissonneau_Duqu=C@21:1/5 to All on Thu Dec 26 12:40:01 2024
    Le 2024-12-26 11:59, Hakan Bayındır a écrit :

    So making any assumptions like we did with spinning drives is mostly
    moot at this point, and the industry is very opaque about that layer.

    That's one of the reasons why I think benchmarking would help here. I
    would expect fewer but larger write operations to help with the wear
    issue though, most FTLs especially the ones on cheaper media are
    probably not too smart and may end up erasing blocks more frequently
    than what is actually necessary with many scattered small writes.

    Let's not forget that any server running with a RAID controller will
    already have a battery backed or non-volatile cache on the card, plus
    new SSDs (esp. higher end consumer (i.e. Samsung 6xx, 8xx, 9xx and
    similar) and enterprise drives have unexpected power loss mitigations
    in hardware. Let it be supercapacitors or non-volatile caches.

    Sure, but the issue at stake here is that in some cases the expected
    data hasn't even been sent to the hardware when the power loss (or
    system crash) occurs. So while the features above help improving
    performance in general (which in turn may contribute to reduce the
    window of time in which the system is vulnerable to a power loss) they
    do not resolve the issue.

    Cheers,

    --
    Julien Plissonneau Duquène

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Richter@21:1/5 to All on Thu Dec 26 13:30:02 2024
    Hi,

    On 12/26/24 18:33, Julien Plissonneau Duquène wrote:

    This should not make any difference in the number of write operations
    necessary, and only affect ordering. The data, metadata journal and
    metadata update still have to be written.

    I would expect that some reordering makes it possible for fewer actual physical write operations to happen, i.e. writes to same/neighbouring
    blocks get merged/grouped (eventually by the hardware if not the kernel) which would make a difference on both spinning devices performance (less seeks) and solid state devices longevity (as these have larger physical blocks), but I don't know if that's actually how it works in that case.

    On SSDs, it does not matter, both because modern media lasts longer than
    the rest of the computer now, and because the load balancer will largely
    ignore the logical block addresses when deciding where to put data into
    the physical medium anyway.

    On harddisks, it absolutely makes a noticeable difference, but so does journaling.

    It would be surprising though that the dpkg man pages (among other
    places) talks about performance degradations if these were not real.

    ext4's delayed allocations mainly mean that the window where the inode
    is zero sized is larger (can be a few seconds after dpkg exits with --force-unsafe-io), so the problem is more observable, while on other
    file systems, you more often get lucky and your files are filled with
    the desired data instead of garbage.

    The delayed allocations, on the other hand, allow the file system to
    merge the entire allocation for the file, instead of gradually extending
    it (but that can be easily fixed by using fallocate(2) ).

    [filesystem level transactions]

    That sounds interesting. But — do we have filesystems on Linux that can
    do that already, or is this still a wishlist item? Also worth noting, at least one well-known implementation in another OS was deprecated [1]
    citing complexity and lack of popularity as the reasons for that
    decision, and the feature is missing in their next-gen FS. So maybe it's
    not that great after all?

    It is complex to the extent that it requires the entire file system to
    be designed around it, including the file system API -- suddenly you get
    things like isolation levels and transaction conflicts that programs
    need to be at least vaguely aware of.

    It would be easier to do in Linux than in Windows, certainly, because on Windows, file contents bypass the file system drivers entirely, and
    there are additional APIs like transfer offload that would interact
    badly with a transactional interface, and that would be sorely missed by
    people using a SAN as storage backend.

    Anyway in the current toolbox besides --force-unsafe-io we also have:
    - volume or FS snapshots, for similar or better safety but not the
    automatic performance gains; probably not (yet?) available on most systems

    Snapshots only work if there is a way to merge them back afterwards.

    What the systemd people are doing with immutable images basically goes
    in the direction of snapshots -- you'd unpack the files using "unsafe"
    I/O, then finally create an image, fsync() that, and then update the OS metadata which image to load at boot.

    - the auto_da_alloc ext4 mount option that AIUI should do The Right
    Thing in dpkg's use case even without the fsync, actual reliability and performance impact unknown; appears to be set by default on trixie

    Yes, that inserts the missing fsync(). :>

    I'd expect it to perform a little bit better than the explicit fsync()
    though, because that does not impose an order of operation between
    files. The downside is that it also does not force an order between the
    file system updates and the rewrite of the dpkg status file.

    What I could see working in dpkg would be delaying the fsync() call
    until right before the rename(), which is in a separate "cleanup" round
    of operations anyway for the cases that matter. The difficulty there is
    that we'd have to keep the file descriptor open until then, which would
    need careful management or a horrible hack so we don't run into the user
    or system-wide limit for open file descriptors, and recover if we do.

    - eatmydata

    That just neuters fsync().

    - io_uring that allows asynchronous file operations; implementation
    would require important changes in dpkg; potential performance gains in dpkg's use case are not yet evaluated AFAIK but it looks like the right solution for that use case.

    That would be Linux specific, though.

    Nowadays, most machines are unlikely to be subject to power failures at
    the worst time:

    Yes, but we have more people running nVidia's kernel drivers now, so it
    all evens out.

    The decision when it is safe to skip fsync() is mostly dependent on
    factors that are not visible to the dpkg process, like "will the result
    of this operation be packed together into an image afterwards?", so I
    doubt there is a good heuristic.

    My feeling is that this is becoming less and less relevant though,
    because it does not matter with SSDs.

    Simon

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_Plissonneau_Duqu=C@21:1/5 to All on Thu Dec 26 16:30:02 2024
    Le 2024-12-26 13:23, Simon Richter a écrit :

    On SSDs, it does not matter, both because modern media lasts longer
    than the rest of the computer now, and because the load balancer will
    largely ignore the logical block addresses when deciding where to put
    data into the physical medium anyway.

    I'm not so sure about that, especially on cheaper and smaller storage.
    There are still recent reports of people being able to wear e.g. SD
    cards to the point of failure in weeks or months though that's certainly
    not with system updates alone. This matters more on embedded devices
    where the storage is not always (easily) replaceable, and some of these
    devices may have fairly long lifespans.

    [transactional FS]
    It would be easier to do in Linux than in Windows

    ... but is sounds very much like we are not anywhere near there yet,
    while others had it working and are now running away from it.

    Snapshots only work if there is a way to merge them back afterwards.

    What the systemd people are doing with immutable images basically goes
    in the direction of snapshots -- you'd unpack the files using "unsafe"
    I/O, then finally create an image, fsync() that, and then update the OS metadata which image to load at boot.

    For integration with dpkg I think the reverse approach would work
    better: the snapshot would only be used to perform a rollback while
    rebooting after a system crash. In the nominal case it would just be
    deleted automatically at the end of the update procedure, after
    confirming that everything is actually written on the medium.

    Anyway currently that option is unavailable on most installed systems.

    - io_uring

    That would be Linux specific, though.

    Not an issue IMO. On systems that can't have it or another similar API
    dpkg could just fall back to using the good old synchronous API, with
    the same performance we have today.

    My feeling is that this is becoming less and less relevant though,
    because it does not matter with SSDs.

    A volunteer it still needed to bench a few runs of large system updates
    on ext4/SSD with and without --force-unsafe-io to sort that out ;)

    Cheers,

    --
    Julien Plissonneau Duquène

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Josefsson@21:1/5 to All on Thu Dec 26 17:50:01 2024
    Did anyone benchmark if this makes any real difference, on a set of
    machines and file systems?

    Say typical x86 laptop+server, arm64 SoC+server, GitLab/GitHub shared
    runners, across ext4, xfs, btrfs, across modern SSD, old SSD/flash and
    spinning rust.

    If eatmydata still results in a performance boost or reliability
    improvement (due to reduced wear and tear) on any of those platforms,
    maybe we can recommend that instead.

    /Simon

    -----BEGIN PGP SIGNATURE-----

    iIoEARYIADIWIQSjzJyHC50xCrrUzy9RcisI/kdFogUCZ22JHhQcc2ltb25Aam9z ZWZzc29uLm9yZwAKCRBRcisI/kdFonmwAP4uRAMwnDIVelbVJdr0Zuu4VpINqF3s XaIPb53hDHWsNQD/eJiF5j7l5+BmmLJfmrc+EydkMv2fvJgNjU3x5IyNtwc=
    =zpQ4
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Hofstaedtler@21:1/5 to All on Thu Dec 26 17:30:01 2024
    * Simon Richter <sjr@debian.org> [241226 13:24]:
    My feeling is that this is becoming less and less relevant though, because
    it does not matter with SSDs.

    This might be true on SSDs backing a single system, but on
    (otherwise well-dimensioned) SANs the I/O-spikes are still very much
    visible. Same is true for various container workloads.

    (And yeah, there are strategies to improve these scenarios, but it's
    not "it just works" territory.)

    Chris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Stone@21:1/5 to Simon Richter on Thu Dec 26 19:30:01 2024
    On Thu, Dec 26, 2024 at 09:23:36PM +0900, Simon Richter wrote:
    My feeling is that this is becoming less and less relevant though,
    because it does not matter with SSDs.

    To summarize: this thread was started with a mistaken belief that the
    current behavior is only important on ext2. In reality the "excessive"
    fsync's are the correct behavior to guarantee atomic replacement of
    files. You can skip all that and in 99.9% of cases you'll be fine, but
    I've seen what happens in the last .1% if the machine dies at just the
    wrong time--and it isn't pretty. In certain situations, with certain filesystems, you can rely on implicit behaviors which will mitigate
    issues, but that will fail in other situations on other filesystems
    without the same implicit guarantees. The right way to get better
    performance is to get a reliable SSD or NVRAM cache. If you have a slow
    disk there are options you can use which will make things noticeably
    faster, and you will get to keep all the pieces if the system blows up
    while you're using those options. Each person should make their own
    decision about whether they want that, and the out-of-box default should
    be the most reliable option.

    Further reading: look at the auto_da_alloc option in ext4. Note that it
    says that doing the rename without the sync is wrong, but there's now a heuristic in ext4 that tries to insert an implicit sync when that
    anti-pattern is used (because so much data got eaten when people did the
    wrong thing). By leaning on that band-aid dpkg might get away with
    skipping the sync, but doing so would require assuming a filesystem for
    which that implicit guarantee is available. If you're on a different
    filesystem or a different kernel all bets would be off. I don't know
    how much difference skipping the fsync's makes these days if they
    get done implicitly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Julian Andres Klode@21:1/5 to Simon Richter on Fri Dec 27 09:20:01 2024
    On Tue, Dec 24, 2024 at 11:10:27PM +0900, Simon Richter wrote:
    Hi,

    On 12/24/24 18:54, Michael Tokarev wrote:

    The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
    issues, where a power-cut in the middle of filesystem metadata
    operation (which dpkg does a lot) might result in in unconsistent filesystem state.

    The thing it protects against is a missing ordering of write() to the contents of an inode, and a rename() updating the name referring to it.

    These are unrelated operations even in other file systems, unless you use data journaling ("data=journaled") to force all operations to the journal,
    in order. Normally ("data=ordered") you only get the metadata update marking the data valid after the data has been written, but with no ordering
    relative to the file name change.

    The order of operation needs to be

    1. create .dpkg-new file
    2. write data to .dpkg-new file
    3. link existing file to .dpkg-old
    4. rename .dpkg-new file over final file name
    5. clean up .dpkg-old file

    When we reach step 4, the data needs to be written to disk and the metadata in the inode referenced by the .dpkg-new file updated, otherwise we atomically replace the existing file with one that is not yet guaranteed to be written out.

    We get two assurances from the file system here:

    1. the file will not contain garbage data -- the number of bytes marked
    valid will be less than or equal to the number of bytes actually written.
    The number of valid bytes will be zero initially, and only after data has been written out, the metadata update to change it to the final value is added to the journal.

    2. creation of the inode itself will be written into the journal before the rename operation, so the file never vanishes.

    What this does not protect against is the file pointing to a zero-size
    inode. The only way to avoid that is either data journaling, which is horribly slow and creates extra writes, or fsync().

    Today, doing an fsync() really hurts, - with SSDs/flash it reduces
    the lifetime of the storage, for many modern filesystems it is a
    costly operation which bloats the metadata tree significantly,
    resulting in all further operations becomes inefficient.

    This should not make any difference in the number of write operations necessary, and only affect ordering. The data, metadata journal and metadata update still have to be written.

    The only way this could be improved is with a filesystem level transaction, where we can ask the file system to perform the entire update atomically -- then all the metadata updates can be queued in RAM, held back until the data has been synchronized by the kernel in the background, and then added to the journal in one go. I would expect that with such a file system, fsync() becomes cheap, because it would just be added to the transaction, and if the kernel gets around to writing the data before the entire transaction is synchronized at the end, it becomes a no-op.

    This assumes that maintainer scripts can be included in the transaction (otherwise we need to flush the transaction before invoking a maintainer script), and that no external process records the successful execution and expects it to be persisted (apt makes no such assumption, because it reads the dpkg status, so this is safe, but e.g. puppet might become confused if
    an operation it marked as successful is rolled back by a power loss).

    What could make sense is more aggressively promoting this option for containers and similar throwaway installations where there is a guarantee that a power loss will have the entire workspace thrown away, such as when working in a CI environment.

    However, even that is not guaranteed: if I create a Docker image for reuse, Docker will mark the image creation as successful when the command returns. Again, there is no ordering guarantee between the container contents and the database recording the success of the operation outside.

    So no, we cannot drop the fsync(). :\

    I do have a plan, namely merge the btrfs snapshot integration into apt,
    if we took a snapshot, we run dpkg with --force-unsafe-io.

    The cool solution would be to take the snapshot, run dpkg inside it,
    and then switch it but one step after the other, that's still very
    much WIP.

    --
    debian developer - deb.li/jak | jak-linux.org - free software dev
    ubuntu core developer i speak de, en

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Tokarev@21:1/5 to All on Thu Jan 2 15:20:01 2025
    02.01.2025 03:00, Aurélien COUDERC wrote:

    Sure but I wouldn’t know how to do that since I’m calling apt and force-unsafe-io seems to be a dpkg option ?

    echo force-unsafe-io > /etc/dpkg/dpkg.conf.d/unsafeio

    before upgrade.

    /mjt

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?ISO-8859-1?Q?=C1ngel?=@21:1/5 to Michael Tokarev on Fri Jan 3 00:50:01 2025
    On 2025-01-02 at 17:11 +0300, Michael Tokarev wrote:
    02.01.2025 03:00, Aurélien COUDERC wrote:

    Sure but I wouldn’t know how to do that since I’m calling apt and force-unsafe-io seems to be a dpkg option ?

    echo force-unsafe-io > /etc/dpkg/dpkg.conf.d/unsafeio

    before upgrade.

    /mjt

    Beware: this should actually be

    echo force-unsafe-io > /etc/dpkg/dpkg.cfg.d/unsafeio

    :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Geert Stappers@21:1/5 to All on Fri Dec 27 13:40:01 2024
    On Fri, Dec 27, 2024 at 03:19:12PM +0300, Hakan Bayındır wrote:
    On 12/27/24 11:18 AM, Julian Andres Klode wrote:
    On Tue, Dec 24, 2024 at 11:10:27PM +0900, Simon Richter wrote:
    ....

    So no, we cannot drop the fsync(). :\

    I do have a plan, namely merge the btrfs snapshot integration into apt,
    if we took a snapshot, we run dpkg with --force-unsafe-io.

    The cool solution would be to take the snapshot, run dpkg inside it,
    and then switch it but one step after the other, that's still very
    much WIP.

    Hi Julian,

    Hello All,


    How would that work for non-BTRFS systems, and if not, will that make Debian a BTRFS-only system?

    I'm personally fine with "This works faster in BTRFS, because we implemented X", but not with "Debian only works on BTRFS".

    Yeah, it feels wrong that dpkg gets file system code, gets code for one particular file system.

    Most likely I don't understand the proposal of Julian
    and hope for further information.


    Cheers,
    Hakan

    @Hakan, please made reading in the discussion order possible.


    Groeten
    Geert Stappers
    --
    Silence is hard to parse

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jonathan Kamens@21:1/5 to All on Fri Dec 27 19:10:01 2024
    This is a multi-part message in MIME format. T24gMTIvMjcvMjQgNzozNCBBTSwgR2VlcnQgU3RhcHBlcnMgd3JvdGU6DQo+IFllYWgsIGl0 IGZlZWxzIHdyb25nIHRoYXQgZHBrZyBnZXRzIGZpbGUgc3lzdGVtIGNvZGUsIGdldHMgY29k ZSBmb3Igb25lDQo+IHBhcnRpY3VsYXIgZmlsZSBzeXN0ZW0uDQoNCkkgZGlzYWdyZWUuIElm IHRoZXJlIGlzIGEgc2lnbmlmaWNhbnQgb3B0aW1pemF0aW9uIHRoYXQgZHBrZyBjYW4gDQpp bXBsZW1lbnQgdGhhdCBpcyBvbmx5IGF2YWlsYWJsZSBmb3IgYnRyZnMsIGFuZCBpZiBlbm91 Z2ggcGVvcGxlIHVzZSANCmJ0cmZzIHRoYXQgdGhlcmUgd291bGQgYmUgc2lnbmlmaWNhbnQg Y29tbXVuYWwgYmVuZWZpdCBpbiB0aGF0IA0Kb3B0aW1pemF0aW9uIGJlaW5nIGltcGxlbWVu dGVkLCBhbmQgaWYgaXQgaXMgZWFzaWVzdCB0byBpbXBsZW1lbnQgdGhlIA0Kb3B0aW1pemF0 aW9uIHdpdGhpbiBkcGtnIGFzIHNlZW1zIHRvIGJlIHRoZSBjYXNlIGhlcmUgKGluZGVlZCwg aXQgbWF5IA0KL29ubHkvIGJlIHBvc3NpYmxlIHRvIGltcGxlbWVudCB0aGUgb3B0aW1pemF0 aW9uIHdpdGhpbiBkcGtnKSwgdGhlbiBpdCANCmlzIHBlcmZlY3RseSByZWFzb25hYmxlIHRv IGltcGxlbWVudCB0aGUgb3B0aW1pemF0aW9uIGluIGRwa2cuIERwa2cgaXMgYSANCmxvdy1s ZXZlbCBPUy1sZXZlbCB1dGlsaXR5LCBpdCBpcyBlbnRpcmVseSByZWFzb25hYmxlIGZvciBp dCB0byBoYXZlIA0KT1MtbGV2ZWwgb3B0aW1pemF0aW9ucyBpbXBsZW1lbnRlZCB3aXRoaW4g aXQuDQoNCiDCoCBqaWsNCg0K
    <!DOCTYPE html>
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    </head>
    <body>
    <div class="moz-cite-prefix">On 12/27/24 7:34 AM, Geert Stappers
    wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:Z26ewycRXLBjZlpR@gpm.stappers.nl">
    <pre wrap="" class="moz-quote-pre">Yeah, it feels wrong that dpkg gets file system code, gets code for one
    particular file system.</pre>
    </blockquote>
    <p>I disagree. If there is a significant optimization that dpkg can
    implement that is only available for btrfs, and if enough people
    use btrfs that there would be significant communal benefit in that
    optimization being implemented, and if it is easiest to implement
    the optimization within dpkg as seems to be the case here (indeed,
    it may <i>only</i> be possible to implement the optimization
    within dpkg), then it is perfectly reasonable to implement the
    optimization in dpkg. Dpkg is a low-level OS-level utility, it is
    entirely reasonable for it to have OS-level optimizations
    implemented within it.</p>
    <p>  jik</p>
    </body>
    </html>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?utf-8?Q?Hakan_Bay=C4=B1nd=C4=B1r?@21:1/5 to All on Fri Dec 27 21:30:02 2024
    On 27 Dec 2024, at 20:46, Jonathan Kamens <jik@kamens.us> wrote:

    On 12/27/24 7:34 AM, Geert Stappers wrote:
    Yeah, it feels wrong that dpkg gets file system code, gets code for one
    particular file system.

    I disagree. If there is a significant optimization that dpkg can implement that is only available for btrfs, and if enough people use btrfs that there would be significant communal benefit in that optimization being implemented, and if it is easiest to
    implement the optimization within dpkg as seems to be the case here (indeed, it may only be possible to implement the optimization within dpkg), then it is perfectly reasonable to implement the optimization in dpkg. Dpkg is a low-level OS-level utility,
    it is entirely reasonable for it to have OS-level optimizations implemented within it.

    I’m also on the same boat with Geert. I don’t think dpkg is the correct place to integrate fs-specific code. It smells like a clear boundary/responsibility violation and opens a big can of worms for future of dpkg.

    Maybe, a wrapper (or more appropriately a pre-post hook) around dpkg which takes these snapshots, calls dpkg with appropriate switches and makes the switch can be implemented, but integrating it into dpkg doesn’t makes sense.

    In ideal case, even that shouldn’t be done, because preferential treatment and proliferation of edge cases make maintenance very hard and unpleasant, and dpkg is critical infrastructure for Debian.

    Cheers,

    Hakan


    jik


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?ISO-8859-1?Q?Aur=E9lien_COUDERC?=@21:1/5 to All on Sat Dec 28 00:20:01 2024
    Le 27 décembre 2024 18:46:02 GMT+01:00, Jonathan Kamens <jik@kamens.us> a écrit :
    On 12/27/24 7:34 AM, Geert Stappers wrote:
    Yeah, it feels wrong that dpkg gets file system code, gets code for one
    particular file system.

    I disagree. If there is a significant optimization that dpkg can implement that is only available for btrfs,

    Julian was talking about apt, but that doesn't fundamentally change the argument.

    and if enough people use btrfs that there would be significant communal benefit in that optimization being implemented, and if it is easiest to implement the optimization within dpkg as seems to be the case here (indeed, it may /only/ be possible to
    implement the optimization within dpkg), then it is perfectly reasonable to implement the optimization in dpkg.

    Totally agreed : yes it would be extremely useful to have some snapshotting feature for apt operations, and no we're never going to get there if we wait for every single filesystem on every kernel to implement it. So if this has to start with btrfs thenΓÇ
    ª great news and super cool !


    --
    Aurélien

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marc Haber@21:1/5 to coucouf@debian.org on Sat Dec 28 10:50:01 2024
    On Sat, 28 Dec 2024 00:13:02 +0100, Aurélien COUDERC
    <coucouf@debian.org> wrote:
    Totally agreed : yes it would be extremely useful to have some snapshotting feature for apt operations, and no we're never going to get there if we wait for every single filesystem on every kernel to implement it. So if this has to start with btrfs thenΓ
    Ǫ great news and super cool !

    Do we have data about how many of our installation would be eligible
    to profit from this new invention? I might think it would be better to
    spend time on features that all users benefit from (in the case of
    dpkg, it would be for example a big overhaul of the conffile handling
    code). But on the other hand, even our dpkg and apt developers are
    doing splendid work and are still volunteers, so I think they'd get to
    choose what they do with their babies.

    If we had Technical Leadership, things would be different, but we
    deliberately chose to have not.

    Greetings
    Marc
    --
    ---------------------------------------------------------------------------- Marc Haber | " Questions are the | Mailadresse im Header Rhein-Neckar, DE | Beginning of Wisdom " |
    Nordisch by Nature | Lt. Worf, TNG "Rightful Heir" | Fon: *49 6224 1600402

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Stone@21:1/5 to Bernhard Schmidt on Fri Jan 3 17:10:01 2025
    On Fri, Jan 03, 2025 at 11:49:05AM +0100, Bernhard Schmidt wrote:
    Shared infrastructure of course. Note that this includes an update of
    the initramfs, which is CPU bound and takes a bit on this system. You
    can take around 45s off the clock for the initramfs regeneration in
    each run. I did a couple of runs and the results were pretty
    consistent.

    This tracks with my experience: optimizing initramfs creation would
    produce *far* more bang for the buck than fiddling with dpkg fsyncs... especially since we tend to do that repeatedly on any major upgrade. :(

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?B?QsOhbGludCBSw6ljemV5?=@21:1/5 to All on Fri Jan 3 19:50:01 2025
    Hi,

    Michael Stone <mstone@debian.org> ezt írta (időpont: 2025. jan. 3., P, 17:07):

    On Fri, Jan 03, 2025 at 11:49:05AM +0100, Bernhard Schmidt wrote:
    Shared infrastructure of course. Note that this includes an update of
    the initramfs, which is CPU bound and takes a bit on this system. You
    can take around 45s off the clock for the initramfs regeneration in
    each run. I did a couple of runs and the results were pretty
    consistent.

    This tracks with my experience: optimizing initramfs creation would
    produce *far* more bang for the buck than fiddling with dpkg fsyncs... especially since we tend to do that repeatedly on any major upgrade. :(

    Well, that depends on the system configuration and on whether the
    upgrade triggers initramfs updates.
    OTOH 45s seems quite slow. Bernhard, do you have zstd installed and initramfs-tools configured to use it?
    On my laptop 3 kernels are installed and on initramfs update round ~10s:

    rbalint@nano:~$ grep -m 1 "model name" /proc/cpuinfo
    model name : 11th Gen Intel(R) Core(TM) i7-1160G7 @ 1.20GHz
    rbalint@nano:~$ grep COMPRESS /etc/initramfs-tools/initramfs.conf ;
    sudo time update-initramfs -k all -u
    # COMPRESS: [ gzip | bzip2 | lz4 | lzma | lzop | xz | zstd ]
    COMPRESS=zstd
    ...
    update-initramfs: Generating /boot/initrd.img-6.8.0-51-generic
    ... (2 more kernels)
    5.63user 5.35system 0:10.48elapsed 104%CPU (0avgtext+0avgdata 29540maxresident)k
    541534inputs+540912outputs (247major+1241330minor)pagefaults 0swaps

    If I switch to gzip, the initramfs update takes ~19s:
    rbalint@nano:~$ grep COMPRESS /etc/initramfs-tools/initramfs.conf ;
    sudo time update-initramfs -k all -u
    # COMPRESS: [ gzip | bzip2 | lz4 | lzma | lzop | xz | zstd ]
    COMPRESS=gzip
    # COMPRESSLEVEL: ...
    # COMPRESSLEVEL=1
    update-initramfs: Generating /boot/initrd.img-6.8.0-51-generic
    ... (2 more kernels)
    10.84user 8.31system 0:18.78elapsed 101%CPU (0avgtext+0avgdata 29556maxresident)k
    541502inputs+530160outputs (246major+1225801minor)pagefaults 0swaps

    Cheers,
    Balint

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)