• Re: Package statistics by downloads

    From =?UTF-8?Q?Julien_Plissonneau_Duqu=C@21:1/5 to All on Mon May 26 11:20:01 2025
    Hi,

    I would be interested in per-package-and-version download statistics and
    trends as well.

    Le 2025-05-03 09:28, Philipp Kern a écrit :

    The problem is that we currently do not want to retain this data.

    You're absolutely right here, there is no point in retaining the raw
    data, it gets stale pretty fast anyway. It has to be processed with
    minimal delay and then fed into some kind of time-series database.

    It'd require a clear measure of usefulness, not just a "it would be
    nice if we had it". And there would need to be actual criteria of what
    we would be interested in. Raw download count? Some measure of
    bucketing by source IP or not? What about container/hermetic builders fetching the same ancient package over and over again from snapshot?
    Does the version matter?

    It would help (as an additional data input) when having to make
    decisions about keeping or removing packages, especially those with very
    low popcons. I would also expect the download counts to have an
    interesting significance (for the sake of estimating the installed base)
    right after releasing a package update.

    Having the count of total successful downloads and the count of unique
    IPs for a given package+version couple (= URI) within a given time
    interval would be a good start. Further refinements could be implemented
    later, like segregating counts by geographical area and
    consumer/corporate address range. With these schemes there are no
    privacy issues as IP addresses are not retained at all in the TSDB (not
    even pseudonymized/anonymized). Time resolution could be hourly for
    starting, and then maybe down to the minute for recent history depending
    on the required processing power and storage.

    There will be lots of packages that are rarely downloaded and still important.

    Indeed. That's just additional data to help making decisions in cases
    where we have doubts.

    Back of the envelope math says that'd be 600 GB/d of raw syslog log
    traffic.

    I don't think that regular syslog is a reasonable way to retrieve that
    amount of data from distant hosts. I don't know what are the options
    with the current cache provider, but transferring already compressed
    data every hour (or a shorter interval, or streaming compressed data)
    sounds better. That would amount to ~2 GiB compressed (~25 GiB
    uncompressed) data every hour on average, which seems workable.

    Is there any way I could get a copy of a log file (current ones with 1% sampling) for experimenting?

    Cheers,

    --
    Julien Plissonneau Duquène

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)