Hi,
I would be interested in per-package-and-version download statistics and
trends as well.
Le 2025-05-03 09:28, Philipp Kern a écrit :
The problem is that we currently do not want to retain this data.
You're absolutely right here, there is no point in retaining the raw
data, it gets stale pretty fast anyway. It has to be processed with
minimal delay and then fed into some kind of time-series database.
It'd require a clear measure of usefulness, not just a "it would be
nice if we had it". And there would need to be actual criteria of what
we would be interested in. Raw download count? Some measure of
bucketing by source IP or not? What about container/hermetic builders fetching the same ancient package over and over again from snapshot?
Does the version matter?
It would help (as an additional data input) when having to make
decisions about keeping or removing packages, especially those with very
low popcons. I would also expect the download counts to have an
interesting significance (for the sake of estimating the installed base)
right after releasing a package update.
Having the count of total successful downloads and the count of unique
IPs for a given package+version couple (= URI) within a given time
interval would be a good start. Further refinements could be implemented
later, like segregating counts by geographical area and
consumer/corporate address range. With these schemes there are no
privacy issues as IP addresses are not retained at all in the TSDB (not
even pseudonymized/anonymized). Time resolution could be hourly for
starting, and then maybe down to the minute for recent history depending
on the required processing power and storage.
There will be lots of packages that are rarely downloaded and still important.
Indeed. That's just additional data to help making decisions in cases
where we have doubts.
Back of the envelope math says that'd be 600 GB/d of raw syslog log
traffic.
I don't think that regular syslog is a reasonable way to retrieve that
amount of data from distant hosts. I don't know what are the options
with the current cache provider, but transferring already compressed
data every hour (or a shorter interval, or streaming compressed data)
sounds better. That would amount to ~2 GiB compressed (~25 GiB
uncompressed) data every hour on average, which seems workable.
Is there any way I could get a copy of a log file (current ones with 1% sampling) for experimenting?
Cheers,
--
Julien Plissonneau Duquène
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)