Forum: Too Lazy BBS

Who's Online

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	28
Nodes:	6 (0 / 6)
Uptime:	52:37:52
Calls:	422
Files:	1,025
Messages:	90,577

when will we rebuild AI-based software from sources/datasets?

From Stefano Zacchiroli@21:1/5 to M. Zhou on Sat Feb 8 15:10:01 2025

Hello Mo, all, I've now read through the full GR text and commentary.
I've a bunch of comments, but I'll post them separately (and/or in MR).
In this mail I'd like to focus on one important aspect related to
implications:

On Sun, Feb 02, 2025 at 12:56:59AM -0500, M. Zhou wrote:

(2) are the options clear enough for vote? Considering lots of the readers may
not be faimiliar with how AI is created. I tried to explain it, as well as the implication if some components are missing.

I'd like to understand, in case option A passes, when and how Debian
will rebuild AI-models that are included in some free software that is
in the archive from its "source" (which will include the full training
dataset, as per option A indeed).

Concrete examples
-----------------

Let's take two simple examples that, size-wise, could fit in the Debian
archive together with training pipelines and datasets.

First, let's take a "small" image classification model that one day
might be included in Digikam or similar free software. Let's say the
trained model is ~1 GiB (like Moondream [1] today) and that the training dataset is ~10 GiB (I've no idea if the Moondrean training dataset is
open data and I'm probably being very conservative with its size here;
just assume it is correct for now).

[1]: https://ollama.com/library/moondream:v2/blobs/e554c6b9de01

For a second, even smaller example, let's consider gnubg (GNU
backgammon) that contains today, in the Debian archive, a trained neural network [2] of less than 1 MiB. Its training data is *not* in the
archive, but is available online without a license (AFAICT) [3] and
weights about ~80 MiB. The training code is available as well [4], even
though still in Python 2.

[2]: https://git.savannah.gnu.org/cgit/gnubg.git/log/gnubg.weights
[3]: https://alpha.gnu.org/gnu/gnubg/nn-training
[4]: https://git.savannah.gnu.org/cgit/gnubg/gnubg-nn.git

What do we put where?
---------------------

Regarding source packages, I suspect that most or our upstream authors
that will end up using free AI will *not* include training datasets in distribution tarballs or Git repositories of the main software. So what
will we do downstream? Do we repack source packages to include the
training datasets? Do we create *separate* source packages for the
training datasets? Do we create a separate (ftp? git-annex? git-lfs?)
hosting place where to host large training datasets to avoid exploding
mirror sizes? Do we simply refer to external hosting places that are not
under Debian control?

Of course this would be irrelevant problem in the gnubg case (we can
just store everything in the source package), but it will become more significant in the Digikam case, and the amount of those cases will
probably increase over time..

(Note that I do not have definitive answers to any of the questions in
this email. And also that I'm *not* raising them as counter arguments to
option A, which is my favorite one at the moment. I just want to make
sure that we have a rough idea of how we will in practice *implement*
option A in a way that fits Debian processes, rather than thinking about
them only after the vote. We have been there for a number of GRs in the
past, and it has not been fun.)

Regarding binary packages, the question applies for large trained AI
models too. On this front, we can rely entirely on upstream software to "unbundle" trained datasets from their software, so that it is
downloaded on the fly on user machines and never enters the Debian
archive. Based on previous answers, I suspect this might be what Mo has
in mind. But I don't find it very satisfactory for a number of reasons:
it will not be universal, we might end up having to host some of the
large models ourselves at some point, and even when it is handled
upstream we will leave users on their own in terms of installation
risks, etc. (Yes, this is not a new problem, and applies to other
software that automatically download plugins and whatnot, but I still
don't like it.)

When do we retrain?
-------------------

The most difficult question for me is: when do we retrain AI models
(shipped in Debian binary packages) from their training datasets
(shipped in source packages)?

In some cases, as pointed out in the GR commentary, it will be
computationally impossible for Debian to do so. We can mostly ignore
these cases, but it hence begs the question: do we want to ship in
Debian trained AI models that *allegedly* have all their training
dataset and pipeline available under free licenses, if we cannot
verify/rebuild them ourselves? (If this smells like the XZ utils hack to
you, you're not alone!)

Let's focus now on the cases that *could* be retrained by Debian,
possibly after buying a dozen on GPUs to put on dedicated buildds.
Potential answers on when to retrain are:

- We never retrain.

We are now back to the already discussed XZ utils smell. I don't think
this would be wise/acceptable in case where it is feasible for us to
retrain.

- We retrain at every package build.

Why not, but it will be quite expensive. (Add here your favorite
environmental concerns.) It will also require some dedicated
scheduling to separate packages that need GPUs to build from others.
This will probably result in a natural separation between source
packages containing training datasets, that can be tagged as needing
GPUs to build, and result in binary packages that are dependencies for
the final software installed by users. Seems appealing to me.

- We retrain every now and then (e.g., once per release).

Compromise situation, between the previous two. Can be analogous to
bootstraping compilers, which we don't do systematically, but can be
done by motivated developers and porters. If we go down this path, we
probably want to standardize some debian/rules target ("bootstrap"?)
that recreate trained models from sources and can then be used to
update source packages that ship them.

Reproducible builds
-------------------

Side, but important consideration: retraining will in most cases not be
bitwise reproducible, as pointed out already in the GR commentary. The practical consequence for Debian is that, all packages that will end up containing the logic for retraining AI models will remain non-bitwise reproducible for the foreseeable future. (Which is an additional good
argument for clearly separating those packages from others.)

Have I missed any other specific Debian process that will be impacted?

Cheers
--
Stefano Zacchiroli . zack@upsilon.cc . https://upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jonas Smedegaard@21:1/5 to All on Sat Feb 8 16:40:01 2025

Quoting Stefano Zacchiroli (2025-02-08 14:57:18)

Concrete examples
-----------------

Let's take two simple examples that, size-wise, could fit in the Debian archive together with training pipelines and datasets.

First, let's take a "small" image classification model that one day
might be included in Digikam or similar free software. Let's say the
trained model is ~1 GiB (like Moondream [1] today) and that the training dataset is ~10 GiB (I've no idea if the Moondrean training dataset is
open data and I'm probably being very conservative with its size here;
just assume it is correct for now).

[1]: https://ollama.com/library/moondream:v2/blobs/e554c6b9de01

For a second, even smaller example, let's consider gnubg (GNU
backgammon) that contains today, in the Debian archive, a trained neural network [2] of less than 1 MiB. Its training data is *not* in the
archive, but is available online without a license (AFAICT) [3] and
weights about ~80 MiB. The training code is available as well [4], even though still in Python 2.

[2]: https://git.savannah.gnu.org/cgit/gnubg.git/log/gnubg.weights
[3]: https://alpha.gnu.org/gnu/gnubg/nn-training
[4]: https://git.savannah.gnu.org/cgit/gnubg/gnubg-nn.git

Another example of seemingly "small" training data is Tesseract. DFSG
of training data is tracked at https://bugs.debian.org/699609 with an optimistic view. A more pessimistic view seems indicated by an upstream
mention of the training data being "all the WWW" and other comments
mention the involvement of non-free fonts: https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

- Jonas

--
* Jonas Smedegaard - idealist & Internet-arkitekt
* Tlf.: +45 40843136 Website: http://dr.jones.dk/
* Sponsorship: https://ko-fi.com/drjones

[x] quote me freely [ ] ask before reusing [ ] keep private --==============â80933956947982925=MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Description: signature
Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

-----BEGIN PGP SIGNATURE-----

wsG7BAABCgBvBYJnp20rCRAsfDFGwaABIUcUAAAAAAAeACBzYWx0QG5vdGF0aW9u cy5zZXF1b2lhLXBncC5vcmfn42I4JxuqcnnYiXPDlhwD2odII08FbWWdegE8OjQo dBYhBJ/j6cNmkaaf9TzGhCx8MUbBoAEhAAD4exAApdzVFQHtg4mZj8fEZu86BtTc E9+F/8B1qSIr4wrdKGEkf46+NijKgVW6XgwbuA75YLL2wUSUkgJz4t4spg+hIn98 166gBMq/i55P70pNR5r+UMvPvsyjh8DVUmhn9MDj8A/eKdKKZ7SC7Pfy6uv50LOi OsQCS0sUHDlb0U7bhQluAGDRS5r28+ChovCobDzASibcl9pCj61emfy6DXgMA6jT fOGZOEQELNSrKacVazrQvpYLS+DgyLjnpxIdGcRx9Ua3PCv3S8V7IGFk8aoaPYKK tmztnuTRS+eJEdg01Jf16AZVWodWSsF4yQNj+6uTgeuQjnTuKUC23x6ldkxkcF6G Ukx3efT0SkKh1lNL75jire7S7vPbJkHob5V3htEy

From Yaroslav Halchenko@21:1/5 to Gerardo Ballabio on Mon Feb 10 16:50:02 2025

On Mon, 10 Feb 2025, Gerardo Ballabio wrote:

Stefano Zacchiroli wrote:

Regarding source packages, I suspect that most or our upstream authors

that will end up using free AI will *not* include training datasets in distribution tarballs or Git repositories of the main software. So what
will we do downstream? Do we repack source packages to include the
training datasets? Do we create *separate* source packages for the
training datasets? Do we create a separate (ftp? git-annex? git-lfs?) hosting place where to host large training datasets to avoid exploding
mirror sizes?

I'd suggest separate source packages *and* put them in a special
section of the archive, that mirrors may choose not to host.

I'm not sure whether there could also be technical problems with many-gigabytes-sized packages, e.g., is there an upper limit to file
size that could be hit? Can the package download be resumed if it is

Just want to chime in support of using git-annex as an underlying
technology and provide a possible sketch on a solution:

- git-annex allows for (just a few most relevant here points out
of wide range of general aspects)

- "link" into a wide range of data sources, and if needed to create
custom "special remotes" to access data.

https://datasets.datalad.org/ is a proof to that -- provides access
to 100s of TBs of data from a wide range of hosting solutions (S3,
tarballs on an http server, some rclone compatible storage solutions, ...)

- seamlessly to end-user diversify/tier data backup/storage.

To that degree, I have (ab)used claimed to be "unlimited"
institutional dropbox to backup over 600TBs of a public data archive,
and then would be easily announce it "dead" whenever data would no
longer available there

- separate "data availability" tracking (stored in git-annex
branch) from actual version tracking (your "master" branch).

This way adjustment of data availability does nohow require changes
to your "versioned data release".

- similarly to how we have https://neuro.debian.net/debian/dists/data/
of "classical" debian packages, there could be a similar suite on
debian with multi-version (multiple versions of a package allowed
within the same suite) package available which would deploy
git-annex upon installation. Then individual Debian suite (stable,
unstable) would rely on using specific version(s) of packages from the
"data" suite

- "data source package" could be just prescription on how to establish
data access, lean and nice. "binary package" also relatively lean,
since data itself accessible via git-annex

- separate service could observe/verify continued availability of data
and when necessary establish

interrupted? Might tar fail to unpack the package? (Although those
could all be solved by splitting the package into chunks...)

FWIW -- https://datasets.datalad.org could be considered a "single
package" as it is a single git repository leading to the next tier of git-submodules, overall reaching into thousands of them.

But, logically, a separate git-repo could be equated to a separate
Debian package. Additionally, "flavors" of packages could subset
types of files to retrieve, e.g. smth like openclipart-png could depend
on openclipart-annex which would just install git-annex repo, and the
-png flavor - fetch all *.png files only.

Access to individual files are orchestrated via git-annex and it has
already built-in mechanisms for data integrity validation (often "on the
fly" while downloading), retries, stall detection etc.

Cheers,
--
Yaroslav O. Halchenko
Center for Open Neuroscience http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
WWW: http://www.linkedin.com/in/yarik

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEExbkF8OjZ/ZZo/zZvot4jUGLaM/oFAmeqGfQACgkQot4jUGLa M/qQDA/+IRfIhtf7ZW+gxFNqS4IkomW2k3mCU7ebMQ3MXlhl+LmjUWWZgvPLTd+c 4WQHyuDmBe4T8wPiSzO1J1CwLsxvkDEEZ+VN9VDxEY2GJgm4kP6IINXiSIsRoeWb FZYvmj1CXQNmIjiVyBIBJt+OTESTtbzd0Ejv+zZGLfVYg02NnhL0SHgeHDzP+m6k QFubNqdWLwv/FDauZrs5m+hmTxav/a1qEbS7lk9hSmB+hUkIsf5N84ll4qf9hIId oom2ca5xkHPjxjygLdAYhx6riogb4H1lYH69tMcmUy/vuN5FgiyUQWfERi9iN5wx BDCuCDCEO1QIGqjHC4pMoP1VuisS1WLVsVpwxrETrFAeFrGOXoXTMZBD6fW54els X+Z+egIqv77LXjgKEuMSAQUCF4F3iIIDKDfm8yYtPV+GIhWrI9ek3y+9clzIb+tL 38FaILqnEQfvAPK4B0qkDcFdOq/hdSIKj42l0YV88aCKzhDtZsvCoSVckZQPjzuM sDa2RUguNpBAhElnTLNLle9VRQFxh+0v5onTo2Jf5jdNKHheLMFVLW0JHkafvUQV NZZH028kDCOjbyQ6lGhH8F7JxvaJczIEt104Uu3MKOM42kk9XmtZVu6JYvv1KJci Y52TmeN5dlwQQzbQvTB4Tq2y3z0eVErf7Wo7ndOxcZ

Who's Online

System Info

when will we rebuild AI-based software from sources/datasets?