• Sorting problem with Unix sort(1) with UTF-8 punctuation characters - l

    From Janis Papanagnou@21:1/5 to All on Wed Feb 19 12:27:18 2025
    I've been sorting punctuation characters on one Unix system and it
    did not produce the expected result. Switching to another system did
    it as expected.

    The test program (it contains non-ASCII middle-dot characters) was

    sort -t $'\t' <<EOT >ยทยทยทยท**ยทยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท< abc1 >ยทยทยทยทยทยทยทยทยทยทยท**ยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยท< efg2 >ยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท**ยท< hij3 >ยทยทยทยทยทยทยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท< klm4 >ยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท**ยทยทยทยท< nop5 >ยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท**ยท**ยทยท< qrs6 >ยทยท**ยทยทยทยทยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยท**ยทยทยทยท< tuv7 >**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท< wxy8
    EOT


    Run on an older system - with sort (GNU coreutils) 8.13 - produced

    **ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท< wxy8 >ยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท**ยท< hij3 >ยทยท**ยทยทยทยทยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยท**ยทยทยทยท< tuv7 >ยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท**ยท**ยทยท< qrs6 >ยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท**ยทยทยทยท< nop5 >ยทยทยทยท**ยทยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท< abc1 >ยทยทยทยทยทยทยทยทยทยทยท**ยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยท< efg2 >ยทยทยทยทยทยทยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท< klm4


    On a newer system - with sort (GNU coreutils) 8.28 - it produced no
    sorting at all (of these lines[*]).

    ยทยทยทยท**ยทยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท< abc1 >ยทยทยทยทยทยทยทยทยทยทยท**ยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยท< efg2 >ยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท**ยท< hij3 >ยทยทยทยทยทยทยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท< klm4 >ยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท**ยทยทยทยท< nop5 >ยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท**ยท**ยทยท< qrs6 >ยทยท**ยทยทยทยทยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยท**ยทยทยทยท< tuv7 >**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท< wxy8


    One hypothesis was that it's some locale issue. So I've copied the
    LC_* settings to the newer system and disabled them one by one.
    Strangely, the one that was responsible for the effect was LC_TIME!

    On the correct sorting system it was defined as
    LC_TIME=de_DE.UTF-8@isodate
    and the one that worked improperly had
    LC_TIME=de_DE.UTF-8

    Now I'm puzzled in many ways...
    If anything, I'd expected LC_COLLATE to have an effect on sorting.
    Then there's no locale with @isodate on that sort-defunct system.
    And clearing that LC_TIME locale or removing the "@isodate" part
    did not change anything; it needs that setting to a non-existing
    locale file to work correctly on the otherwise not correctly
    sorting system.

    Does anyone have an idea what's going on here?

    I'm reluctant to globally set LC_TIME=de_DE.UTF-8@isodate
    (since there is no file with that name in the locale directories).

    Thanks.

    Janis

    [*] Lines with additional other contents than the depicted payload
    were sorted correctly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Weisgerber@21:1/5 to Janis Papanagnou on Wed Feb 19 20:22:45 2025
    On 2025-02-19, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    If anything, I'd expected LC_COLLATE to have an effect on sorting.
    Then there's no locale with @isodate on that sort-defunct system.
    And clearing that LC_TIME locale or removing the "@isodate" part
    did not change anything; it needs that setting to a non-existing
    locale file to work correctly on the otherwise not correctly
    sorting system.

    My working hypothesis would be that setting LC_TIME to a nonexistent
    locale causes an error that invalidates the _whole_ locale setting
    and causes a fallback to a default setting, likely the "C" locale.
    You can check that sorting with LC_ALL=C or an invalid value like
    LC_ALL=foobar will produce your "correct" result.

    A corollary from this would be that your "sort-defunct" system uses
    a different collation order than your "correctly" sorting system
    for the de_DE.UTF-8 locale.

    On the FreeBSD 14-STABLE system I'm typing this on, sorting your
    example data with my typical C.UTF-8 locale produces your expected
    result, sorting with de_DE.UTF-8 (or en_US.UTF-8) produces a different
    order.

    ยทยทยทยท**ยทยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท< abc1 >ยทยทยทยทยทยทยทยทยทยทยท**ยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยท< efg2 >ยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท**ยท< hij3

    Also, I have no idea what could be considered the "correct" sorting
    order for this.

    --
    Christian "naddy" Weisgerber naddy@mips.inka.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to janis_papanagnou+ng@hotmail.com on Wed Feb 19 22:35:43 2025
    In article <vp4f6o$288ui$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    I've been sorting punctuation characters on one Unix system and it
    did not produce the expected result. Switching to another system did
    it as expected.

    The test program (it contains non-ASCII middle-dot characters) was

    sort -t $'\t' <<EOT

    Do you really have the '$' there?

    - Dan C.

    One hypothesis was that it's some locale issue. So I've copied the
    LC_* settings to the newer system and disabled them one by one.
    Strangely, the one that was responsible for the effect was LC_TIME!

    On the correct sorting system it was defined as
    LC_TIME=de_DE.UTF-8@isodate
    and the one that worked improperly had
    LC_TIME=de_DE.UTF-8

    Now I'm puzzled in many ways...
    If anything, I'd expected LC_COLLATE to have an effect on sorting.
    Then there's no locale with @isodate on that sort-defunct system.
    And clearing that LC_TIME locale or removing the "@isodate" part
    did not change anything; it needs that setting to a non-existing
    locale file to work correctly on the otherwise not correctly
    sorting system.

    Does anyone have an idea what's going on here?

    I'm reluctant to globally set LC_TIME=de_DE.UTF-8@isodate
    (since there is no file with that name in the locale directories).

    Thanks.

    Janis

    [*] Lines with additional other contents than the depicted payload
    were sorted correctly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Christian Weisgerber on Thu Feb 20 01:54:15 2025
    On 19.02.2025 21:22, Christian Weisgerber wrote:
    On 2025-02-19, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    If anything, I'd expected LC_COLLATE to have an effect on sorting.
    Then there's no locale with @isodate on that sort-defunct system.
    And clearing that LC_TIME locale or removing the "@isodate" part
    did not change anything; it needs that setting to a non-existing
    locale file to work correctly on the otherwise not correctly
    sorting system.

    My working hypothesis would be that setting LC_TIME to a nonexistent
    locale causes an error that invalidates the _whole_ locale setting
    and causes a fallback to a default setting, likely the "C" locale.
    You can check that sorting with LC_ALL=C or an invalid value like LC_ALL=foobar will produce your "correct" result.

    That was actually also my own first locale-based hypothesis, and
    setting LC_ALL=C was the first thing I tried (before identifying
    the strange LC_TIME "solution"). But that setting did not change
    that strange behavior. (But see below.)


    A corollary from this would be that your "sort-defunct" system uses
    a different collation order than your "correctly" sorting system
    for the de_DE.UTF-8 locale.

    Right. The point is that the two systems I'm using are handled by
    me in different ways. The old system is one where I changed on a
    system level all deficiencies I encountered; the @isodate locale
    is such a beast. (It works on that system.) The newer system is
    one that got standard updates and less (or hardy any) "fixes" by
    me, so that I'd expect to work better "as designed". (But the
    opposite is the case.)

    On the old system I've explicitly defined
    LC_TIME=de_DE.UTF-8@isodate
    LC_COLLATE=C.UTF-8
    and on the new system the collation is
    LC_TIME=de_DE.UTF-8
    LC_COLLATE=en_US.UTF-8

    I'm sure there was a reason why the setting is now "en_US" instead
    of "de_DE" (like almost all others LC-settings), so I'm reluctant
    to change that. (But setting LC_COLLATE to "C.UTF-8" works as well.)

    I think I'll have to use a local (not system wide) LC-change to fix
    the issue to behave as I'd expect without touching the rest.


    On the FreeBSD 14-STABLE system I'm typing this on, sorting your
    example data with my typical C.UTF-8 locale produces your expected
    result, sorting with de_DE.UTF-8 (or en_US.UTF-8) produces a different
    order.

    ยทยทยทยท**ยทยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท< abc1
    ยทยทยทยทยทยทยทยทยทยทยท**ยทยทยทยทยทยท**ยทยทยทยทยทยทยทยทยทยท< efg2
    ยท**ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท**ยท< hij3

    Also, I have no idea what could be considered the "correct" sorting
    order for this.

    Unless all used punctuation characters are disregarded or treated as
    having all the same sorting order it should IMO be obvious that the
    original unsorted form is not correct.

    Thanks for your reply. It helped to find another setting that produces
    the desired result.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Dan Cross on Thu Feb 20 02:04:42 2025
    On 19.02.2025 23:35, Dan Cross wrote:
    In article <vp4f6o$288ui$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    I've been sorting punctuation characters on one Unix system and it
    did not produce the expected result. Switching to another system did
    it as expected.

    The test program (it contains non-ASCII middle-dot characters) was

    sort -t $'\t' <<EOT

    Do you really have the '$' there?

    Yes, I have the habit of using Kornshell's "ANSI C String" feature
    wherever control characters are to be placed in shell code.[*]

    Here it's actually an unnecessary remains of the tests I had made
    with tries of 'sort -k1' (to be sure that 'sort' will operate on
    the (first) punctuation field and not on the (second) text items).
    (I should have removed it for posting purposes to not confuse the
    matter for the readers.)

    Janis

    [*] They make the intention of code visible and as opposed to using
    literal TABs safer. For posting it's also advantageous since it
    allows to copy paste samples without wondering what the whitespace
    actually is or to fear whether the client software "intelligently"
    changes the format.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Janis Papanagnou on Thu Feb 20 01:36:59 2025
    On Thu, 20 Feb 2025 01:54:15 +0100, Janis Papanagnou wrote:

    I'm sure there was a reason why the setting is now "en_US" instead of
    "de_DE" (like almost all others LC-settings), so I'm reluctant to change that.

    You realize itโ€™s easy to temporarily change environment variables on a per-command-invocation basis?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Weisgerber@21:1/5 to Dan Cross on Thu Feb 20 01:16:18 2025
    On 2025-02-19, Dan Cross <cross@spitfire.i.gajendra.net> wrote:

    The test program (it contains non-ASCII middle-dot characters) was

    sort -t $'\t' <<EOT

    Do you really have the '$' there?

    Yup.
    Dollar single quotes are a thing, says the latest edition of POSIX: https://pubs.opengroup.org/onlinepubs/9799919799/utilities/V3_chap02.html#tag_19_02_04

    --
    Christian "naddy" Weisgerber naddy@mips.inka.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Christian Weisgerber on Thu Feb 20 06:37:01 2025
    On 19.02.2025 21:22, Christian Weisgerber wrote:

    My working hypothesis would be that setting LC_TIME to a nonexistent
    locale causes an error that invalidates the _whole_ locale setting
    and causes a fallback to a default setting, likely the "C" locale.

    One thing I've forgotten to mention here; if an unknown locale is used
    I get an error "...: unknown locale". But it seems to be sufficient if
    the first part (e.g. "de_DE") is existing to not create a diagnostic.
    Suffixes as in "de_DE.xyz" are not notified as errors. So, yes, some fall-back/default must be in place since the commands are executed
    anyway, even with a notified wrong locale. I'd expect that "de_DE.xyz"
    would fall back to "de_DE", but that is speculation. (A peek into the
    'strace' differences didn't really enlighten me.[*])

    Janis

    [*] For LC_ALL=de_DE.xyz@abc for example:

    open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3 open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 3 open("/usr/lib/locale/de_DE.xyz@abc/LC_IDENTIFICATION",
    O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/lib/locale/de_DE@abc/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC)
    = -1 ENOENT (No such file or directory) open("/usr/lib/locale/de.xyz@abc/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC)
    = -1 ENOENT (No such file or directory) open("/usr/lib/locale/de@abc/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) =
    -1 ENOENT (No such file or directory) open("/usr/lib/locale/de_DE.xyz/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC)
    = -1 ENOENT (No such file or directory) open("/usr/lib/locale/de_DE/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1
    ENOENT (No such file or directory) open("/usr/lib/locale/de.xyz/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) =
    -1 ENOENT (No such file or directory) open("/usr/lib/locale/de/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1
    ENOENT (No such file or directory) open("/usr/share/locale-langpack/de_DE.xyz@abc/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/de_DE@abc/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/de.xyz@abc/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/de@abc/LC_IDENTIFICATION",
    O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/de_DE.xyz/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/de_DE/LC_IDENTIFICATION",
    O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/de.xyz/LC_IDENTIFICATION",
    O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/de/LC_IDENTIFICATION",
    O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lem Novantotto@21:1/5 to All on Thu Feb 20 11:14:42 2025
    Il Wed, 19 Feb 2025 12:27:18 +0100, Janis Papanagnou ha scritto:

    I've been sorting punctuation characters on one Unix system and it did
    not produce the expected result. Switching to another system did it as expected.

    The second system (not working "properly") is treating all dots as equal,
    so it sorts just the letters.

    Also my system doesn't sort properly. In my system:

    $ locale
    LANG=it_IT.UTF-8
    LANGUAGE=it_IT
    LC_CTYPE="it_IT.UTF-8"
    LC_NUMERIC="it_IT.UTF-8"
    LC_TIME="it_IT.UTF-8"
    LC_COLLATE="it_IT.UTF-8"
    LC_MONETARY="it_IT.UTF-8"
    LC_MESSAGES="it_IT.UTF-8"
    LC_PAPER="it_IT.UTF-8"
    LC_NAME="it_IT.UTF-8"
    LC_ADDRESS="it_IT.UTF-8"
    LC_TELEPHONE="it_IT.UTF-8"
    LC_MEASUREMENT="it_IT.UTF-8"
    LC_IDENTIFICATION="it_IT.UTF-8"
    LC_ALL=

    Let's see. In my /usr/share/i18n/locales/it_IT, I have yhis section:

    LC_COLLATE
    copy "iso14651_t1"
    END LC_COLLATE

    In your second system, you have LC_COLLATE=en_US or de_DE. It's the same:
    in the relative files there is always the same section:
    LC_COLLATE
    copy "iso14651_t1"
    END LC_COLLATE

    But in /usr/share/i18n/locales/C there is:

    LC_COLLATE
    % The keyword 'codepoint_collation' in any part of any LC_COLLATE
    % immediately discards all collation information and causes the
    % locale to use strcmp/wcscmp for collation comparison. This is
    % exactly what is needed for C (ASCII) or C.UTF-8.
    codepoint_collation
    END LC_COLLATE

    And here it is:

    $ LC_COLLATE=C sort yada yada

    gives the correct sorting.
    --
    Bye, Lem
    Talis erit dies qualem egeris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)