• when will we rebuild AI-based software from sources/datasets?

    From Stefano Zacchiroli@21:1/5 to M. Zhou on Sat Feb 8 15:10:01 2025
    Hello Mo, all, I've now read through the full GR text and commentary.
    I've a bunch of comments, but I'll post them separately (and/or in MR).
    In this mail I'd like to focus on one important aspect related to
    implications:

    On Sun, Feb 02, 2025 at 12:56:59AM -0500, M. Zhou wrote:
    (2) are the options clear enough for vote? Considering lots of the readers may
    not be faimiliar with how AI is created. I tried to explain it, as well as the implication if some components are missing.

    I'd like to understand, in case option A passes, when and how Debian
    will rebuild AI-models that are included in some free software that is
    in the archive from its "source" (which will include the full training
    dataset, as per option A indeed).


    Concrete examples
    -----------------

    Let's take two simple examples that, size-wise, could fit in the Debian
    archive together with training pipelines and datasets.

    First, let's take a "small" image classification model that one day
    might be included in Digikam or similar free software. Let's say the
    trained model is ~1 GiB (like Moondream [1] today) and that the training dataset is ~10 GiB (I've no idea if the Moondrean training dataset is
    open data and I'm probably being very conservative with its size here;
    just assume it is correct for now).

    [1]: https://ollama.com/library/moondream:v2/blobs/e554c6b9de01

    For a second, even smaller example, let's consider gnubg (GNU
    backgammon) that contains today, in the Debian archive, a trained neural network [2] of less than 1 MiB. Its training data is *not* in the
    archive, but is available online without a license (AFAICT) [3] and
    weights about ~80 MiB. The training code is available as well [4], even
    though still in Python 2.

    [2]: https://git.savannah.gnu.org/cgit/gnubg.git/log/gnubg.weights
    [3]: https://alpha.gnu.org/gnu/gnubg/nn-training
    [4]: https://git.savannah.gnu.org/cgit/gnubg/gnubg-nn.git


    What do we put where?
    ---------------------

    Regarding source packages, I suspect that most or our upstream authors
    that will end up using free AI will *not* include training datasets in distribution tarballs or Git repositories of the main software. So what
    will we do downstream? Do we repack source packages to include the
    training datasets? Do we create *separate* source packages for the
    training datasets? Do we create a separate (ftp? git-annex? git-lfs?)
    hosting place where to host large training datasets to avoid exploding
    mirror sizes? Do we simply refer to external hosting places that are not
    under Debian control?

    Of course this would be irrelevant problem in the gnubg case (we can
    just store everything in the source package), but it will become more significant in the Digikam case, and the amount of those cases will
    probably increase over time..

    (Note that I do not have definitive answers to any of the questions in
    this email. And also that I'm *not* raising them as counter arguments to
    option A, which is my favorite one at the moment. I just want to make
    sure that we have a rough idea of how we will in practice *implement*
    option A in a way that fits Debian processes, rather than thinking about
    them only after the vote. We have been there for a number of GRs in the
    past, and it has not been fun.)

    Regarding binary packages, the question applies for large trained AI
    models too. On this front, we can rely entirely on upstream software to "unbundle" trained datasets from their software, so that it is
    downloaded on the fly on user machines and never enters the Debian
    archive. Based on previous answers, I suspect this might be what Mo has
    in mind. But I don't find it very satisfactory for a number of reasons:
    it will not be universal, we might end up having to host some of the
    large models ourselves at some point, and even when it is handled
    upstream we will leave users on their own in terms of installation
    risks, etc. (Yes, this is not a new problem, and applies to other
    software that automatically download plugins and whatnot, but I still
    don't like it.)


    When do we retrain?
    -------------------

    The most difficult question for me is: when do we retrain AI models
    (shipped in Debian binary packages) from their training datasets
    (shipped in source packages)?

    In some cases, as pointed out in the GR commentary, it will be
    computationally impossible for Debian to do so. We can mostly ignore
    these cases, but it hence begs the question: do we want to ship in
    Debian trained AI models that *allegedly* have all their training
    dataset and pipeline available under free licenses, if we cannot
    verify/rebuild them ourselves? (If this smells like the XZ utils hack to
    you, you're not alone!)

    Let's focus now on the cases that *could* be retrained by Debian,
    possibly after buying a dozen on GPUs to put on dedicated buildds.
    Potential answers on when to retrain are:

    - We never retrain.

    We are now back to the already discussed XZ utils smell. I don't think
    this would be wise/acceptable in case where it is feasible for us to
    retrain.

    - We retrain at every package build.

    Why not, but it will be quite expensive. (Add here your favorite
    environmental concerns.) It will also require some dedicated
    scheduling to separate packages that need GPUs to build from others.
    This will probably result in a natural separation between source
    packages containing training datasets, that can be tagged as needing
    GPUs to build, and result in binary packages that are dependencies for
    the final software installed by users. Seems appealing to me.

    - We retrain every now and then (e.g., once per release).

    Compromise situation, between the previous two. Can be analogous to
    bootstraping compilers, which we don't do systematically, but can be
    done by motivated developers and porters. If we go down this path, we
    probably want to standardize some debian/rules target ("bootstrap"?)
    that recreate trained models from sources and can then be used to
    update source packages that ship them.


    Reproducible builds
    -------------------

    Side, but important consideration: retraining will in most cases not be
    bitwise reproducible, as pointed out already in the GR commentary. The practical consequence for Debian is that, all packages that will end up containing the logic for retraining AI models will remain non-bitwise reproducible for the foreseeable future. (Which is an additional good
    argument for clearly separating those packages from others.)


    Have I missed any other specific Debian process that will be impacted?

    Cheers
    --
    Stefano Zacchiroli . zack@upsilon.cc . https://upsilon.cc/zack _. ^ ._
    Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jonas Smedegaard@21:1/5 to All on Sat Feb 8 16:40:01 2025
    Quoting Stefano Zacchiroli (2025-02-08 14:57:18)
    Concrete examples
    -----------------

    Let's take two simple examples that, size-wise, could fit in the Debian archive together with training pipelines and datasets.

    First, let's take a "small" image classification model that one day
    might be included in Digikam or similar free software. Let's say the
    trained model is ~1 GiB (like Moondream [1] today) and that the training dataset is ~10 GiB (I've no idea if the Moondrean training dataset is
    open data and I'm probably being very conservative with its size here;
    just assume it is correct for now).

    [1]: https://ollama.com/library/moondream:v2/blobs/e554c6b9de01

    For a second, even smaller example, let's consider gnubg (GNU
    backgammon) that contains today, in the Debian archive, a trained neural network [2] of less than 1 MiB. Its training data is *not* in the
    archive, but is available online without a license (AFAICT) [3] and
    weights about ~80 MiB. The training code is available as well [4], even though still in Python 2.

    [2]: https://git.savannah.gnu.org/cgit/gnubg.git/log/gnubg.weights
    [3]: https://alpha.gnu.org/gnu/gnubg/nn-training
    [4]: https://git.savannah.gnu.org/cgit/gnubg/gnubg-nn.git

    Another example of seemingly "small" training data is Tesseract. DFSG
    of training data is tracked at https://bugs.debian.org/699609 with an optimistic view. A more pessimistic view seems indicated by an upstream
    mention of the training data being "all the WWW" and other comments
    mention the involvement of non-free fonts: https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

    - Jonas

    --
    * Jonas Smedegaard - idealist & Internet-arkitekt
    * Tlf.: +45 40843136 Website: http://dr.jones.dk/
    * Sponsorship: https://ko-fi.com/drjones

    [x] quote me freely [ ] ask before reusing [ ] keep private --==============â80933956947982925=MIME-Version: 1.0
    Content-Transfer-Encoding: 7bit
    Content-Description: signature
    Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

    -----BEGIN PGP SIGNATURE-----

    wsG7BAABCgBvBYJnp20rCRAsfDFGwaABIUcUAAAAAAAeACBzYWx0QG5vdGF0aW9u cy5zZXF1b2lhLXBncC5vcmfn42I4JxuqcnnYiXPDlhwD2odII08FbWWdegE8OjQo dBYhBJ/j6cNmkaaf9TzGhCx8MUbBoAEhAAD4exAApdzVFQHtg4mZj8fEZu86BtTc E9+F/8B1qSIr4wrdKGEkf46+NijKgVW6XgwbuA75YLL2wUSUkgJz4t4spg+hIn98 166gBMq/i55P70pNR5r+UMvPvsyjh8DVUmhn9MDj8A/eKdKKZ7SC7Pfy6uv50LOi OsQCS0sUHDlb0U7bhQluAGDRS5r28+ChovCobDzASibcl9pCj61emfy6DXgMA6jT fOGZOEQELNSrKacVazrQvpYLS+DgyLjnpxIdGcRx9Ua3PCv3S8V7IGFk8aoaPYKK tmztnuTRS+eJEdg01Jf16AZVWodWSsF4yQNj+6uTgeuQjnTuKUC23x6ldkxkcF6G Ukx3efT0SkKh1lNL75jire7S7vPbJkHob5V3htEy
  • From Yaroslav Halchenko@21:1/5 to Gerardo Ballabio on Mon Feb 10 16:50:02 2025
    On Mon, 10 Feb 2025, Gerardo Ballabio wrote:
    Stefano Zacchiroli wrote:
    Regarding source packages, I suspect that most or our upstream authors
    that will end up using free AI will *not* include training datasets in distribution tarballs or Git repositories of the main software. So what
    will we do downstream? Do we repack source packages to include the
    training datasets? Do we create *separate* source packages for the
    training datasets? Do we create a separate (ftp? git-annex? git-lfs?) hosting place where to host large training datasets to avoid exploding
    mirror sizes?

    I'd suggest separate source packages *and* put them in a special
    section of the archive, that mirrors may choose not to host.

    I'm not sure whether there could also be technical problems with many-gigabytes-sized packages, e.g., is there an upper limit to file
    size that could be hit? Can the package download be resumed if it is

    Just want to chime in support of using git-annex as an underlying
    technology and provide a possible sketch on a solution:

    - git-annex allows for (just a few most relevant here points out
    of wide range of general aspects)

    - "link" into a wide range of data sources, and if needed to create
    custom "special remotes" to access data.

    https://datasets.datalad.org/ is a proof to that -- provides access
    to 100s of TBs of data from a wide range of hosting solutions (S3,
    tarballs on an http server, some rclone compatible storage solutions, ...)

    - seamlessly to end-user diversify/tier data backup/storage.

    To that degree, I have (ab)used claimed to be "unlimited"
    institutional dropbox to backup over 600TBs of a public data archive,
    and then would be easily announce it "dead" whenever data would no
    longer available there

    - separate "data availability" tracking (stored in git-annex
    branch) from actual version tracking (your "master" branch).

    This way adjustment of data availability does nohow require changes
    to your "versioned data release".

    - similarly to how we have https://neuro.debian.net/debian/dists/data/
    of "classical" debian packages, there could be a similar suite on
    debian with multi-version (multiple versions of a package allowed
    within the same suite) package available which would deploy
    git-annex upon installation. Then individual Debian suite (stable,
    unstable) would rely on using specific version(s) of packages from the
    "data" suite

    - "data source package" could be just prescription on how to establish
    data access, lean and nice. "binary package" also relatively lean,
    since data itself accessible via git-annex

    - separate service could observe/verify continued availability of data
    and when necessary establish

    interrupted? Might tar fail to unpack the package? (Although those
    could all be solved by splitting the package into chunks...)

    FWIW -- https://datasets.datalad.org could be considered a "single
    package" as it is a single git repository leading to the next tier of git-submodules, overall reaching into thousands of them.

    But, logically, a separate git-repo could be equated to a separate
    Debian package. Additionally, "flavors" of packages could subset
    types of files to retrieve, e.g. smth like openclipart-png could depend
    on openclipart-annex which would just install git-annex repo, and the
    -png flavor - fetch all *.png files only.

    Access to individual files are orchestrated via git-annex and it has
    already built-in mechanisms for data integrity validation (often "on the
    fly" while downloading), retries, stall detection etc.

    Cheers,
    --
    Yaroslav O. Halchenko
    Center for Open Neuroscience http://centerforopenneuroscience.org
    Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
    WWW: http://www.linkedin.com/in/yarik


    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEExbkF8OjZ/ZZo/zZvot4jUGLaM/oFAmeqGfQACgkQot4jUGLa M/qQDA/+IRfIhtf7ZW+gxFNqS4IkomW2k3mCU7ebMQ3MXlhl+LmjUWWZgvPLTd+c 4WQHyuDmBe4T8wPiSzO1J1CwLsxvkDEEZ+VN9VDxEY2GJgm4kP6IINXiSIsRoeWb FZYvmj1CXQNmIjiVyBIBJt+OTESTtbzd0Ejv+zZGLfVYg02NnhL0SHgeHDzP+m6k QFubNqdWLwv/FDauZrs5m+hmTxav/a1qEbS7lk9hSmB+hUkIsf5N84ll4qf9hIId oom2ca5xkHPjxjygLdAYhx6riogb4H1lYH69tMcmUy/vuN5FgiyUQWfERi9iN5wx BDCuCDCEO1QIGqjHC4pMoP1VuisS1WLVsVpwxrETrFAeFrGOXoXTMZBD6fW54els X+Z+egIqv77LXjgKEuMSAQUCF4F3iIIDKDfm8yYtPV+GIhWrI9ek3y+9clzIb+tL 38FaILqnEQfvAPK4B0qkDcFdOq/hdSIKj42l0YV88aCKzhDtZsvCoSVckZQPjzuM sDa2RUguNpBAhElnTLNLle9VRQFxh+0v5onTo2Jf5jdNKHheLMFVLW0JHkafvUQV NZZH028kDCOjbyQ6lGhH8F7JxvaJczIEt104Uu3MKOM42kk9XmtZVu6JYvv1KJci Y52TmeN5dlwQQzbQvTB4Tq2y3z0eVErf7Wo7ndOxcZ