• Re: performance regressions in 15.0 [The Microsoft Dev Kit 2023 buildworld took about 6 minutes less time for jemalloc 5.3.0, not more, for non-debug contexts]

    From Mark Millard@marklmi@yahoo.com to muc.lists.freebsd.stable on Mon Dec 8 09:23:52 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Dec 8, 2025, at 04:46, Mateusz Guzik <mjguzik@gmail.com> wrote:
    On Sun, Dec 7, 2025 at 5:19rC>PM Mark Millard <marklmi@yahoo.com> wrote:

    On Dec 6, 2025, at 19:03, Mark Millard <marklmi@yahoo.com> wrote:

    On Dec 6, 2025, at 14:25, Warner Losh <imp@bsdimp.com> wrote:

    On Sat, Dec 6, 2025, 3:06rC>PM Mark Millard <marklmi@yahoo.com> wrote: >>>>
    On Dec 6, 2025, at 06:14, Mark Millard <marklmi@yahoo.com> wrote:

    Mateusz Guzik <mjguzik_at_gmail.com> wrote on
    Date: Sat, 06 Dec 2025 10:50:08 UTC :

    I got pointed at phoronix: https://www.phoronix.com/review/freebsd-15-amd-epyc

    While I don't treat their results as gospel, a FreeBSD vs FreeBSD test >>>>>>> showing a slowdown most definitely warrants a closer look.

    They observed slowdowns when using iperf over localhost and when compiling llvm.

    I can confirm both problems and more.

    I found the profiling tooling for userspace to be broken again so I >>>>>>> did not investigate much and I'm not going to dig into it further. >>>>>>>
    Test box is AMD EPYC 9454 48-Core Processor, with the 2 systems
    running as 8 core vms under kvm.
    . . .



    Both of the below are from ampere3 (aarch64) instead, its
    2 most recent "bulk -a" runs that completed, elapsed times
    shown for qt6-webengine-6.9.3 builds:

    150releng-arm64-quarterly qt6-webengine-6.9.3 53:33:46
    135arm64-default qt6-webengine-6.9.3 38:43:36
    A somewhat better comparison is now available from the
    active builds, here quarterly 14.3 to match with the
    quarterly 15.0 . . . https://pkg-status.freebsd.org/ampere1/data/143arm64-quarterly/1081574d367d/logs/qt6-webengine-6.9.3.log
    shows 14.3 quarterly getting the qt6-webengine-6.9.3
    build timing: 38:25:51
    on ampere1 with:
    Host OSVERSION: 1600004
    Jail OSVERSION: 1403000
    15.0 is definitely the large one.
    As far as I know ampere1 and ampere3 match for there hardware configurations. (Not that such information is public so I do not have great evidence.)
    Given the similarity to 135arm64-default, I will generally
    not switch to referencing 14.3's timing below, leaving that
    implicit.
    For reference:

    Host OSVERSION: 1600000
    Jail OSVERSION: 1500068

    vs.

    Host OSVERSION: 1600000
    Jail OSVERSION: 1305000

    The difference for the above is in the Jail's world builds,
    not in the boot's (kernel+world) builds.


    For reference:


    https://pkg-status.freebsd.org/ampere3/build.html?mastername=150releng-arm64-quarterly&build=88084f9163ae

    build of www/qt6-webengine | qt6-webengine-6.9.3 ended at Sun Nov 30 05:40:02 -00 2025
    build time: 2D:05:33:52


    https://pkg-status.freebsd.org/ampere3/build.html?mastername=135arm64-default&build=f5384fe59be6

    build of www/qt6-webengine | qt6-webengine-6.9.3 ended at Sat Nov 22 15:33:34 -00 2025
    build time: 1D:14:43:41


    Expanding the notes to before and after jemalloc 5.3.0
    was merged to main: beefy18 was the main-amd64 builder
    before and somewhat after the jemalloc 5.3.0 merge from
    vendor branch:

    Before: p2650762431ca_s51affb7e971 261:29:13 building 36074 port-packages, start 05 Aug 2025 01:10:59 GMT
    ( jemalloc 5.3.0 merge from vendor branch: 15 Aug 2025)
    After : p9652f95ce8e4_sb45a181a74c 428:49:20 building 36318 port-packages, start 19 Aug 2025 01:30:33 GMT

    (The log files are long gone for port-packages built.)

    main-15 used a debug jail world but 15.0-RELEASE does not.

    I'm not aware of such a port-package builder context for a
    non-debug jail world before and after a jemalloc 5.3.0 merge.

    A few months before I landed the jemalloc patches, i did 4 or 5 from dirt buildworlds. The elasped time was, iirc, with 1 or 2%. Enough to see maybe a diff with the small sample size, but not enough for ministat to trigger at 95%. I didn't recall keeping the data for this and can't find it now. And I'm not even sure, in hindsight, I ran a good experiment. It might be related, or not, but it would be easy enough for someone to setup a two jails: one just before and one just after. Build from scratch the world (same hash) on both. That would test it since you'd be holding all other variables constant.

    When we imported the tip of FreeBSD main at work, we didn't get a cpu change trigger from our tests that I recall...


    The range of commits look like:

    rCo git: 9a7c512a6149 - main - ucred groups: restore a useful comment Eric van Gyzen
    rCo git: bf6039f09a30 - main - jemalloc: Unthin contrib/jemalloc Warner Losh
    rCo git: a0dfba697132 - main - jemalloc: Update jemalloc.xml.in per FreeBSD-diffs Warner Losh
    rCo git: 718b13ba6c5d - main - jemalloc: Add FreeBSD's updates to jemalloc_preamble.h.in Warner Losh
    rCo git: 6371645df7b0 - main - jemalloc: Add JEMALLOC_PRIVATE_NAMESPACE for the libc namespace Warner Losh
    rCo git: da260ab23f26 - main - jemalloc: Only replace _pthread_mutex_init_calloc_cb in private namespace Warner Losh
    rCo git: c43cad871720 - main - jemalloc: Merge from jemalloc 5.3.0 vendor branch Warner Losh
    rCo git: 69af14a57c9e - main - jemalloc: Note update in UPDATING and RELNOTES Warner Losh

    I've started a build of a non-debug 9a7c512a6149 world
    to later create a chroot to do a test buildworld in.

    I'll also do a build of a non-debug 69af14a57c9e world
    to later create the other chroot to do a test
    buildworld in.

    non-debug means my use of:

    WITH_MALLOC_PRODUCTION=
    WITHOUT_ASSERT_DEBUG=
    WITHOUT_PTHREADS_ASSERTIONS=
    WITHOUT_LLVM_ASSERTIONS=

    I've used "env WITH_META_MODE=" as it cuts down on the
    volume and frequency of scrolling output. I'll do the
    same later.

    If there is anything you want controlled in a different
    way, let me know.

    The Windows Dev Kit 2023 is booted (world and kernel)
    with:

    # uname -apKU
    FreeBSD aarch64-main-pbase 16.0-CURRENT FreeBSD 16.0-CURRENT main-n281922-4872b48b175c GENERIC-NODEBUG arm64 aarch64 1600004 1600004

    which is from an official pkgbase distribution. So the
    boot-world is a debug world but the boot-kernel is not.

    The Windows Dev Kit 2023 will take some time for such
    -j8 builds and I may end up sleeping in the middle of
    the sequence someplace. So it may be a while before
    I've any comparison/contrast data to report.



    Summary for jemalloc for before vs. at 5.3.0
    for *non-debug* contexts doing the buildworld :

    before 5.3.0: 9754 seconds (about 2.7 hrs)
    with 5.3.0: 9384 seconds (about 2.6 hrs)


    While in principle this can accurately reflect the difference, the
    benchmark itself is not valid as is.
    I remind of what started this for my specific
    messages:
    On ampere3 :
    150releng-arm64-quarterly qt6-webengine-6.9.3 53:33:46
    135arm64-default qt6-webengine-6.9.3 38:43:36
    A fairly large scale multiplication factor. The test
    was a cross check on that, at least that is how I
    interpreted Warner's request and was my purpose in
    agreeing to do the test.
    I tried to do what Warner asked. It adds a little data
    to what he reported.
    I do not view the result as indicating much more than
    the two builds are approximately equal for the time
    taken. I have no reason to care if the timings swapped,
    for example: same conclusion for the comparison I was
    making.
    It would be highly unlikely repeated tests to have
    variability reach anywhere near the qt6-webengine-6.9.3
    scale factor difference.
    First, you can't just run it once -- the result needs to be proven
    repeatable and profiled.For a build of a that duration, for this few resources,
    For comparison to:
    150releng-arm64-quarterly qt6-webengine-6.9.3 53:33:46
    135arm64-default qt6-webengine-6.9.3 38:43:36
    and that size of scale factor, I'd say, yes I can,
    given the near equality that I got. It is eveidence
    that the type of test has missed being relevant,
    other than showing no such systematic scale factor
    for the type of test.
    FYI: 32 GiBytes of RAM. 8 cores that are compatible
    with Cortex-A76 targeting, 4 are X1C and 4 are A78C.
    USB3 in use, with a U.2 1.4 TB Optane as media, via
    an adapter. UFS file system.
    for all I know the real factor was randomness from I/O.
    Not for a change of scale to instead be
    similar to: 53:33:46 vs. 38:43:36 for
    building qt6-webengine-6.9.3 as far as
    I can see.
    That aside you need a sanitized baseline. From the description it not
    clear to me at all if you are doing the build with the clang perf
    regression fixed or not.
    My result indicate, in part, that it is not a
    good way to investigate the 53:33:46 vs.
    38:43:36 for building qt6-webengine-6.9.3 .
    I doubt I need a better baseline for that
    judgment now. I'd need a different type of
    test activity.
    Even that aside, I outlined 3 more regressions:
    - slower binary startup to begin with
    - slower syscalls which fail with an error
    - slower syscall interface in the first place

    Out of the the first one is most important here.
    Do you expect any combination of those to be a
    significant part of the scale factor difference
    for 53:33:46 vs. 38:43:36 for building
    qt6-webengine-6.9.3 ?
    If I was to work on this,
    I would not claim that we are targeting the same
    issue, even with Warner's request considered that
    added what he was targetting.
    seeing that the question at hand is whether
    the jemalloc update is a problem,
    I think the specifics of the qt6-webengine-6.9.3
    building would need to be the investigative
    context for what was "at hand" for me. In part
    that judgement is based on the test I did finding
    near equality for jemalloc .
    I would bypass all of the above and
    instead take 14.3 (not stable/14!) as a baseline + jemalloc update on
    top. This eliminates all of the factors other than jemalloc itself.
    I'll note that ampere1 with a 14.3 jail took 38:25:51
    for its build of qt6-webengine-6.9.3 . That scale of
    timing is not specific to 13.5 jail worlds.
    building world also seems a little fishy here and it is not clear to
    me at all what version have you built
    The 9xxx sec timings were both building:
    69af14a57c9e - main - jemalloc: Note update in UPDATING and RELNOTES Warner Los
    (the end of the jemalloc commit sequence).
    One build was 69af14a57c9e in a chroot rebuilding itself.
    The other built 69af14a57c9e via:
    9a7c512a6149 - main - ucred groups: restore a useful comment Eric van Gyzen (the just before jemalloc 5.3.0 related commits started)
    The 2 chroots differ just by which jemalloc version
    was in use.
    -- was the new jemalloc thing
    building new jemalloc and old jemalloc building old jemalloc? More
    imporantly I would be worried some of the build picks up whatever
    jemalloc it finds to use during some of the build.

    I would benchmark this by building a big port (not timing dependencies
    of the port, just the port itself -- maybe even chromium or firefox).
    Using qt6-webengine-6.9.3 would mean using a known
    to have an issue context, at least for aarch64.
    But I can not take weeks of time for such an activity.
    amd64 is messier to compare official builds for
    because of lack of uniformity across the builder
    machines and each type of build being done on
    its own builder machine: no examples of same
    machine builds both.
    That's of course quite a bit of effort and if there is nobody to do
    that (or compatible), imo the pragmatic play is to revert the jemalloc
    update for the time being. This restores the known working state and
    should the update be a good thing it can land for 15.1, maybe fixed
    up.
    150releng-arm64-quarterly on ampere3:
    llvm21-21.1.2 : 21:26:14
    143arm64-quarterly on ampere1:
    llvm21-21.1.2 : 15:24:24
    Again a notable time ratio. (default/latest would
    not be a llvm version match.)
    Some basic looking around does not suggest to me
    that qt6-webengine-6.9.3 is somehow unique for
    having notable timing ratios for quarterly on an
    ampere* .
    But, as of yet, I've no good evidence for blaming
    jemalloc as a major contributor to those timing
    ratios --or for blaming any other specific part
    of 15.0 .
    ===
    Mark Millard
    marklmi at yahoo.com
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Mark Millard@marklmi@yahoo.com to muc.lists.freebsd.stable on Tue Dec 9 12:32:46 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Dec 9, 2025, at 07:22, Rozhuk Ivan <rozhuk.im@gmail.com> wrote:
    On Mon, 8 Dec 2025 09:23:52 -0800
    Mark Millard <marklmi@yahoo.com> wrote:

    But, as of yet, I've no good evidence for blaming
    jemalloc as a major contributor to those timing
    ratios --or for blaming any other specific part
    of 15.0 .

    If you want to bench jmalloc - there is another ways to do that without building something.
    Try to find some sythetic benchmarks.
    Also jmalloc can be build without OS rebuild and linked with bench.

    This 2 things can reduce time to tests, but it will eliminate OS integation factors.
    Run same bench on different OS may give more info.

    [I've eliminated direct Email to most everyone
    for this reply. There is not even minor new
    technical content.]
    At this point I'm more likely to explore if I
    get similar ratios as ampere[13] do for some
    port-package builds that have the large ratios on
    ampere[13]. There are examples that are not as
    overall time consuming for ampere[13] as what I've
    already referenced (but are still non-trivial for
    the time taken). As stands, I do not have a good
    reproduce-the-issue context, much less one with
    build time frames I'd be willing to deal with in
    my environment.
    ===
    Mark Millard
    marklmi at yahoo.com
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2