• Module urljoin does not appear to work with scheme Gemini

    From Schimon Jehudah@21:1/5 to All on Mon Apr 21 08:38:45 2025
    Good day.

    Yesterday, I have added support for a new syndication format, Gemini
    feed.

    Yet, it appears that module urljoin fails at its task, even though
    module urlsplit correctly handles Gemini.

    Python 3.13.3

    from urllib.parse import urljoin
    urljoin('gopher://gopher.floodgap.com:70/1/overbite', '../one-level-up') 'gopher://gopher.floodgap.com:70/one-level-up'
    urljoin('gopher://gopher.floodgap.com:70/1/overbite', 'same-level') 'gopher://gopher.floodgap.com:70/1/same-level'
    urljoin('gemini://woodpeckersnest.space/~schapps/journal/2025-04-20-slixfeed-gemini-and-twtxt.gmi', '../one-level-up')
    '../one-level-up'
    urljoin('gemini://woodpeckersnest.space/~schapps/journal/2025-04-20-slixfeed-gemini-and-twtxt.gmi', 'same-level')
    'same-level'
    from urllib.parse import urlsplit
    urlsplit('gopher://gopher.floodgap.com:70/1/overbite') SplitResult(scheme='gopher', netloc='gopher.floodgap.com:70', path='/1/overbite', query='', fragment='')
    urlsplit('gemini://woodpeckersnest.space/~schapps/journal/2025-04-20-slixfeed-gemini-and-twtxt.gmi')
    SplitResult(scheme='gemini', netloc='woodpeckersnest.space', path='/~schapps/journal/2025-04-20-slixfeed-gemini-and-twtxt.gmi', query='', fragment='')
    https://git.xmpp-it.net/sch/Slixfeed/src/branch/master/slixfeed/parser/gmi.py

    Is this a problem with the module urljoin?

    To whom should reports about such concern be conveyed?

    Please advise.

    Kind regards,
    Schimon

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Henry S. Thompson@21:1/5 to Schimon Jehudah via Python-list on Tue Apr 22 15:33:52 2025
    Schimon Jehudah via Python-list writes:

    Yesterday, I have added support for a new syndication format, Gemini
    feed.

    I note that 'gemini' is not (yet?) a registered URI scheme:

    https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml

    ht
    --
    Henry S. Thompson, School of Informatics, University of Edinburgh
    10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND
    e-mail: ht@inf.ed.ac.uk
    URL: https://www.ltg.ed.ac.uk/~ht/
    [mail from me _always_ has a .sig like this -- mail without it is forged spam]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Schimon Jehudah@21:1/5 to Henry S. Thompson on Tue Apr 22 18:22:53 2025
    Is there an "ignore" option for "urljoin" to allow schemes that are not included in the registry of the interpreter of the Python computer
    language?

    I think that it is needed to have, even if it is not registered, as
    there are ongoing attempts to try to censor Gemini and Gopher.

    gemini://woodpeckersnest.space/~schapps/journal/2024-05-28-censoring-gemini-and-gopher.gmi

    Schimon

    On Tue, 22 Apr 2025 15:33:52 +0100
    "Henry S. Thompson" <ht@inf.ed.ac.uk> wrote:

    Schimon Jehudah via Python-list writes:

    Yesterday, I have added support for a new syndication format, Gemini
    feed.

    I note that 'gemini' is not (yet?) a registered URI scheme:

    https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml

    ht

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Henry S. Thompson@21:1/5 to Schimon Jehudah on Wed Apr 23 15:13:51 2025
    Schimon Jehudah writes:

    Is there an "ignore" option for "urljoin" to allow schemes that are not included in the registry of the interpreter of the Python computer
    language?

    Some approach to support future-proofing in general would seem to be
    in order. Given some other precedents, adding a boolean argument
    called either 'strict' or 'lax' would be my preference.

    It would seem that for backwards-compatibility, even though it feels
    backwards from the in-principle correct approach, it should be either 'strict=True' or 'lax=False'.

    I note that there are 440 schemes registered [1] as of today, with the following statuses:

    275 Provisional
    99 Permanent
    18 Historical
    48 [not given]

    The (python3.11) implementation of "urljoin" depends on a list of 18 'uses_relative' scheme names: it would be silly to expect anyone to
    actually check even just the other 81 Permanent schemes to see if they
    should be added to this list, much less the Provisional or Historical
    ones, and even sillier to expect that the list ought to be regularly synchronised with the IANA registry.

    ht

    [1] https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml
    --
    Henry S. Thompson, School of Informatics, University of Edinburgh
    10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND
    e-mail: ht@inf.ed.ac.uk
    URL: https://www.ltg.ed.ac.uk/~ht/
    [mail from me _always_ has a .sig like this -- mail without it is forged spam]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anders Munch@21:1/5 to Henry S. Thompson on Thu Apr 24 08:36:12 2025
    Henry S. Thompson wrote:
    Some approach to support future-proofing in general would seem to be in order.
    Given some other precedents, adding a boolean argument called either 'strict' or 'lax' would be my preference.

    An alternative would be to refactor urllib.parse to use strategy objects
    for schemes.

    parse.py contains a number of lists of scheme names, that act as flags to control parsing behaviour:
    uses_relative, uses_netloc, uses_params, non_hierarchical, uses_query and uses_fragment.
    (If written today they would be sets, but this is very old code that predates sets!)
    Group that information by scheme instead of by flag name, in e.g. a dataclass, and
    you have made yourself a strategy object lookup table:

    scheme_options = {
    'https': SchemeOptions(uses_relative=True, uses_netloc=True, uses_params=True),
    'git': SchemeOptions(uses_relative=False, uses_netloc=True, uses_params=False),
    ...
    }

    Once you have that, you can add the strategy object as an optional argument to functions. If the argument is not given, you find a strategy object from scheme_options to use. If the argument is given, you use that.

    The best part of this approach is that you now have a way of saying "treat this scheme exactly like https":

    from urllib import parse
    parse.urljoin('sptth://...', '../one-level-up', options=parse.scheme_options['https'])

    Note: I wrote this before I realised that the lists non_hierarchical, uses_query
    and uses_fragment are not used. With only three options instead of six, making a strategy object is not quite as attractive. But still worth considering.

    regards, Anders

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)