• Rationale for aligning data on even bytes in a Unix shell file?

    From Janis Papanagnou@21:1/5 to All on Sat Apr 26 13:47:15 2025
    In a "C" file (of the Kornshell software) I stumbled across this
    comment: "Each command in the history file starts on an even byte
    and is null-terminated."

    I wonder what's the reason behind that even-byte-alignment, on "C"
    level or on Unix/files level. Any ideas?

    Janis

    Note: Since it's not a shell question but more of a programming or
    platform related question I try to get the answer here (and not in comp.unix.shell); just saying to prevent distracting calls to order.
    Thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Janis Papanagnou on Sat Apr 26 15:00:28 2025
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    In a "C" file (of the Kornshell software) I stumbled across this
    comment: "Each command in the history file starts on an even byte
    and is null-terminated."

    I wonder what's the reason behind that even-byte-alignment, on "C"
    level or on Unix/files level. Any ideas?

    Possibly to support 16-bit character sets?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Keith Thompson on Sun Apr 27 01:12:14 2025
    On 26.04.2025 23:34, Keith Thompson wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    In a "C" file (of the Kornshell software) I stumbled across this
    comment: "Each command in the history file starts on an even byte
    and is null-terminated."

    I wonder what's the reason behind that even-byte-alignment, on "C"
    level or on Unix/files level. Any ideas?

    Possibly to support 16-bit character sets?

    I don't think it supports 16-bit character sets.

    Currently it doesn't. - I'm not sure it's intended for a possible
    16 bit extension; I wouldn't think so but don't know.

    There was some note in the source about supporting 8 bit characters
    in "version 1" (instead of 7 bit ASCII, in "version 0"), IIRC. (So
    at least it could be possible in principle to support 16 bit with a
    "version 2", or so.)


    Unlike bash history files, which are plain text, ksh history files
    are in a binary format.

    I don't know whether the format includes any multi-byte integers.

    No, as far as I could see it's (besides the \0-terminated strings)
    and the occasional \0 padding byte occasionally just line markers
    0x82 0x00 0xNN 0xNN 0xNN 0x00 and some "undo" marker with a version
    number 0x81 0x00 (e.g.). So these markers also fit in multiples of
    16 bits. (Not sure how these sequences would conflict with 16 bit
    characters that have the same encoding.)

    If it does, reading such values directly into memory might be easier
    on some platforms if they're aligned.

    The relevant source file is src/cmd/ksh93/edit/history.c, in <https://github.com/ksh93/ksh>. It has functions to manipulate the
    history file, but I don't see a full description of the file format.

    Somewhere in that file I found it... <lookup> ...yes, a comment at
    the top of the file. You can find some more details when searching
    for the CPP tokens "HIST_CMDNO" and "HIST_UNDO".

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Janis Papanagnou on Sat Apr 26 23:35:34 2025
    On 2025-04-26, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    In a "C" file (of the Kornshell software) I stumbled across this
    comment: "Each command in the history file starts on an even byte
    and is null-terminated."

    The full(er) text is:

    Each command in the history file starts on an even byte and is
    null-terminated. The first byte must contain the special character
    HIST_UNDO and the second byte is the version number.

    If the file is snarfed into a buffer, then if each command starts
    on an even byte, those two-byte pairs will be aligned and can be
    accessed as two-byte integers.

    At a glance, I don't see where the code relies on that, but maybe
    historically it did.

    The alignment could be of help if you're looking at the file
    with "od -tx2a".

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kaz Kylheku on Sun Apr 27 02:51:00 2025
    On 27.04.2025 01:35, Kaz Kylheku wrote:
    On 2025-04-26, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    In a "C" file (of the Kornshell software) I stumbled across this
    comment: "Each command in the history file starts on an even byte
    and is null-terminated."
    [...]

    The alignment could be of help if you're looking at the file
    with "od -tx2a".

    Well, the intention was more to get some better readable text from
    ksh's history file. That annoyed me for long, and when I recently
    read in CUS Kenny's question about GNU grep and \0 it reminded me
    to have a closer look into ksh's format. As Keith mentioned, we do
    not have *that* issue with Bash's format, but for Ksh I'd like to
    have some tools (e.g. a grep, editing macros, etc.) to handle such
    files in a simpler (text-)way (without touching its native format).

    As it seems Ksh's history format (while "binary") is quite primitive
    so it won't be a big deal, I suppose; skip the markers and split on
    the nulls.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Keith Thompson on Sun Apr 27 14:11:40 2025
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    In a "C" file (of the Kornshell software) I stumbled across this
    comment: "Each command in the history file starts on an even byte
    and is null-terminated."

    I wonder what's the reason behind that even-byte-alignment, on "C"
    level or on Unix/files level. Any ideas?

    Possibly to support 16-bit character sets?

    I don't think it supports 16-bit character sets.

    Unlike bash history files, which are plain text, ksh history files
    are in a binary format.

    They're text with a leading 16-bit identifier. Which itself
    is probably the reason for the even-byte alignment.

    0000000: 8101 7669 6d20 7e2f 4d61 696c 2f65 7472 ..vim ~/Mail/etr
    0000010: 6173 6465 0a00 726d 207e 2f4d 6169 6c2f asde..rm ~/Mail/
    0000020: 6574 7261 7364 650a 0000 4c41 4e47 3d65 etrasde...LANG=e
    0000030: 6e5f 5553 2e75 7466 3820 7072 696e 7466 n_US.utf8 printf
    0000040: 2022 2527 3130 2e32 665c 6e22 2024 2828 "%'10.2f\n" $((
    0000050: 2031 3534 302e 3020 2a20 3231 312e 3839 1540.0 * 211.89
    0000060: 2029 290a 0000 4c41 4e47 3d65 6e5f 5553 ))...LANG=en_US


    I don't know whether the format includes any multi-byte integers.
    If it does, reading such values directly into memory might be easier
    on some platforms if they're aligned.

    The relevant source file is src/cmd/ksh93/edit/history.c, in ><https://github.com/ksh93/ksh>. It has functions to manipulate the
    history file, but I don't see a full description of the file format.

    /*
    * Each command in the history file starts on an even byte is null terminated.
    * The first byte must contain the special character H_UNDO and the second
    * byte is the version number. The sequence H_UNDO 0, following a command,
    * nullifies the previous command. A six byte sequence starting with
    * H_CMDNO is used to store the command number so that it is not necessary
    * to read the file from beginning to end to get to the last block of
    * commands. This format of this sequence is different in version 1
    * then in version 0. Version 1 allows commands to use the full 8 bit
    * character set. It can understand version 0 format files.
    */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Bonita Montero on Sun Apr 27 21:55:24 2025
    On 27.04.2025 20:32, Bonita Montero wrote:
    Am 26.04.2025 um 17:00 schrieb Scott Lurndal:

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

    In a "C" file (of the Kornshell software) I stumbled across this
    comment: "Each command in the history file starts on an even byte
    and is null-terminated."
    I wonder what's the reason behind that even-byte-alignment, on "C"
    level or on Unix/files level. Any ideas?

    Possibly to support 16-bit character sets?

    Unix has a big problem that it doesn't support 16 bit character sets.
    Win32 supported UCS-2 from the beginning and UTF-16 afaik since Windows
    2000.

    What would be the advantage of a 16 bit encoding? (As opposed to,
    say, UTF-8.)

    With Unix there's even not a standard charset for the filesystem;
    each filename character is just an octet.

    I think we have to distinguish the technical base size, an octet,
    from the actual filenames. My Linux has no problem to represent,
    say, filenames in Chinese or German umlaut characters that require
    for representation 2 octets.

    That this is possible in the first place (as I had been told some
    years ago) is an effect of the character-set non-specific encoding.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Bonita Montero on Sun Apr 27 21:40:18 2025
    On 2025-04-27, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 26.04.2025 um 17:00 schrieb Scott Lurndal:

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

    In a "C" file (of the Kornshell software) I stumbled across this
    comment: "Each command in the history file starts on an even byte
    and is null-terminated."
    I wonder what's the reason behind that even-byte-alignment, on "C"
    level or on Unix/files level. Any ideas?

    Possibly to support 16-bit character sets?

    Unix has a big problem that it doesn't support 16 bit character sets.

    16 bit encodings are junk; not supporting them doesn't cause an
    issue whatsoever.

    Win32 supported UCS-2 from the beginning and UTF-16 afaik since Windows
    2000. With Unix there's even not a standard charset for the filesystem;
    each filename character is just an octet.

    What is the standard charset of a Windows file name?
    Be sure your answer covers FAT32 and such.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Sun Apr 27 22:53:45 2025
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 26.04.2025 um 17:00 schrieb Scott Lurndal:

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

    In a "C" file (of the Kornshell software) I stumbled across this
    comment: "Each command in the history file starts on an even byte
    and is null-terminated."
    I wonder what's the reason behind that even-byte-alignment, on "C"
    level or on Unix/files level. Any ideas?

    Possibly to support 16-bit character sets?

    Unix has a big problem that it doesn't support 16 bit character sets.

    X/Open would argue that your statement is 100% false, as unix multibyte character sets (wchar_t, for example) have been around for three decades.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to Scott Lurndal on Sun Apr 27 23:45:06 2025
    In article <ZzyPP.2742757$OrR5.2488189@fx18.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 26.04.2025 um 17:00 schrieb Scott Lurndal:

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

    In a "C" file (of the Kornshell software) I stumbled across this
    comment: "Each command in the history file starts on an even byte
    and is null-terminated."
    I wonder what's the reason behind that even-byte-alignment, on "C"
    level or on Unix/files level. Any ideas?

    Possibly to support 16-bit character sets?

    Unix has a big problem that it doesn't support 16 bit character sets.

    X/Open would argue that your statement is 100% false, as unix multibyte >character sets (wchar_t, for example) have been around for three decades.

    Bonita being 100% wrong is not exactly front page news.

    --
    The book "1984" used to be a cautionary tale;
    Now it is a "how-to" manual.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From vallor@21:1/5 to Bonita.Montero@gmail.com on Mon Apr 28 01:21:39 2025
    On Mon, 28 Apr 2025 02:53:45 +0200, Bonita Montero
    <Bonita.Montero@gmail.com> wrote in <vumjhf$20u1e$1@raubtier-asyl.eternal-september.org>:

    Am 27.04.2025 um 21:55 schrieb Janis Papanagnou:

    I think we have to distinguish the technical base size, an octet,
    from the actual filenames. My Linux has no problem to represent,
    say, filenames in Chinese or German umlaut characters that require
    for representation 2 octets.

    You're joking. Which applications currently can handle more than
    a 7 bit characters with Unix files ?

    _[/home/vallor/tmp]_(vallor@lm)🐧_
    $ touch 調和
    _[/home/vallor/tmp]_(vallor@lm)🐧_
    $ ls
    調和
    _[/home/vallor/tmp]_(vallor@lm)🐧_
    $ ls -l
    total 0
    -rw-rw-r-- 1 vallor vallor 0 Apr 27 17:59 調和

    ObC (What did I mess up here?):

    $ cat readit.c
    #include <stdio.h>
    #include <sys/types.h>
    #include <dirent.h>
    #include <string.h>

    int main(void)
    {
    DIR * this = {0};
    struct dirent * entry = {0};
    char * s;

    this = opendir(".");
    while ((entry = readdir(this))!=NULL)
    {
    if(!strcmp(entry->d_name,".")) continue;
    if(!strcmp(entry->d_name,"..")) continue;
    for(s = entry->d_name; *s ; s++)
    {
    printf("%x\n",*s);
    }
    puts("---");
    }

    return 0;
    }

    --
    -v

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Mon Apr 28 01:22:44 2025
    On Sun, 27 Apr 2025 20:32:58 +0200, Bonita Montero wrote:

    Unix has a big problem that it doesn't support 16 bit character sets.
    Win32 supported UCS-2 from the beginning and UTF-16 afaik since Windows
    2000.

    Unfortunately, Windows has had to deal with the UCS-2→UTF-16 encoding
    kludge ever since then. Linux managed to avoid all that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Mon Apr 28 04:47:55 2025
    On Mon, 28 Apr 2025 06:28:05 +0200, Bonita Montero wrote:

    Am 28.04.2025 um 03:22 schrieb Lawrence D'Oliveiro:

    Unfortunately, Windows has had to deal with the UCS-2→UTF-16 encoding
    kludge ever since then. ...

    That's not true. The codepoints for the surrogates were unused before.

    The problem is the fact that you have to deal with surrogates.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From vallor@21:1/5 to Bonita.Montero@gmail.com on Mon Apr 28 04:55:13 2025
    On Mon, 28 Apr 2025 06:28:44 +0200, Bonita Montero
    <Bonita.Montero@gmail.com> wrote in <vun04h$2fjrn$2@raubtier-asyl.eternal-september.org>:

    Am 28.04.2025 um 03:21 schrieb vallor:
    On Mon, 28 Apr 2025 02:53:45 +0200, Bonita Montero
    <Bonita.Montero@gmail.com> wrote in
    <vumjhf$20u1e$1@raubtier-asyl.eternal-september.org>:

    Am 27.04.2025 um 21:55 schrieb Janis Papanagnou:

    I think we have to distinguish the technical base size, an octet,
    from the actual filenames. My Linux has no problem to represent, say,
    filenames in Chinese or German umlaut characters that require for
    representation 2 octets.

    You're joking. Which applications currently can handle more than a 7
    bit characters with Unix files ?

    _[/home/vallor/tmp]_(vallor@lm)🐧_
    $ touch 調和 _[/home/vallor/tmp]_(vallor@lm)🐧_
    $ ls 調和 _[/home/vallor/tmp]_(vallor@lm)🐧_
    $ ls -l total 0 -rw-rw-r-- 1 vallor vallor 0 Apr 27 17:59 調和

    ObC (What did I mess up here?):

    $ cat readit.c #include <stdio.h>
    #include <sys/types.h>
    #include <dirent.h>
    #include <string.h>

    int main(void)
    {
    DIR * this = {0};
    struct dirent * entry = {0};
    char * s;

    this = opendir(".");
    while ((entry = readdir(this))!=NULL)
    {
    if(!strcmp(entry->d_name,".")) continue;
    if(!strcmp(entry->d_name,"..")) continue;
    for(s = entry->d_name; *s ; s++)
    { printf("%x\n",*s);
    }
    puts("---");
    }

    return 0;
    }


    https://stackoverflow.com/questions/38948141/how-are-linux-shells-and-
    filesystem-unicode-aware

    I don't see your point. Could I ask you to elaborate?

    --
    -v

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Bonita Montero on Mon Apr 28 09:29:46 2025
    On 28.04.2025 02:53, Bonita Montero wrote:
    Am 27.04.2025 um 21:55 schrieb Janis Papanagnou:

    I think we have to distinguish the technical base size, an octet,
    from the actual filenames. My Linux has no problem to represent,
    say, filenames in Chinese or German umlaut characters that require
    for representation 2 octets.

    You're joking. Which applications currently can handle more than
    a 7 bit characters with Unix files ?

    All Applications on my Unix system, for example

    $ touch Ölüberschuß.txt
    $ ls Ö*
    Ölüberschuß.txt
    $ ls Ö* | od -t x1 -c
    0000000 c3 96 6c c3 bc 62 65 72 73 63 68 75 c3 9f 2e 74
    303 226 l 303 274 b e r s c h u 303 237 . t
    0000020 78 74 0a
    x t \n
    0000023
    $ rm Ölüberschuß.txt


    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Bonita Montero on Mon Apr 28 09:42:16 2025
    On 28.04.2025 09:27, Bonita Montero wrote:
    Am 28.04.2025 um 06:55 schrieb vallor:

    On Mon, 28 Apr 2025 06:28:44 +0200, Bonita Montero

    <Bonita.Montero@gmail.com> wrote in
    https://stackoverflow.com/questions/38948141/how-are-linux-shells-and-

    filesystem-unicode-aware
    I don't see your point. Could I ask you to elaborate?

    There's no standardized charset for Unix filesystems beyond 7 bit ASCII.
    If you store chars >= 128 in one application they may become different
    chars in another.

    Why are you repeatedly saying that; it's not true, and examples have
    been provided. If applications are locale-aware - which is standard
    for a long time - you can consistently use what you like. Maybe you
    don't know, maybe you mean something different, in any case, please
    provide some evidence instead of repeating opinions so that we can
    discuss that and see where any misunderstandings are or where we are
    just talking at cross purpose.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Bonita Montero on Mon Apr 28 10:08:15 2025
    On 28.04.2025 09:44, Bonita Montero wrote:
    Am 28.04.2025 um 09:42 schrieb Janis Papanagnou:

    Why are you repeatedly saying that; it's not true, and examples have
    been provided. If applications are locale-aware - which is standard
    for a long time - you can consistently use what you like.

    There's no standard locale for a filesystem.

    My file system (and obviously also the file systems of others that
    are posting here) have no problems with any locale.

    The historic architecture of Linux file systems is able to represent
    files having file names in arbitrary languages. That's why the Unix
    file systems don't show the issues that other (popular) OSes show.

    Generally, and specifically if you choose to use international
    characters for file names, the prevalent and nowadays the de facto
    standard is to use an UTF-8 encoding.

    Linux sucks with that.

    Okay, noted; you repeat opinions and skip and snip and ignore what
    had been already said and shown. (I think it makes no sense trying
    to continue a serious discussion with you.)

    BTW, the Unix example I posted this morning ("Ölüberschuß.txt") was
    done on a Linux system.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From vallor@21:1/5 to All on Mon Apr 28 09:04:41 2025
    On Mon, 28 Apr 2025 01:21:39 -0000 (UTC), vallor <vallor@cultnix.org>
    wrote in <vuml73$1riea$1@dont-email.me>:

    ObC (What did I mess up here?):

    Answer (I think?): the %x printf conversion expects an unsigned int.

    $ cat readit.c
    #include <stdio.h>
    #include <sys/types.h>
    #include <dirent.h>
    #include <string.h>

    int main(void)
    {
    DIR * this = {0};
    struct dirent * entry = {0};
    char * s;
    int dc = 0; // display character

    this = opendir(".");
    while ((entry = readdir(this))!=NULL)
    {
    if(!strcmp(entry->d_name,".")) continue;
    if(!strcmp(entry->d_name,"..")) continue;
    for(s = entry->d_name; *s ; s++)
    {

    dc = (*s<0) ? 256+*s : *s;

    printf("%x\n",dc);

    }
    puts("---");
    }

    return 0;

    --
    -v

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Muttley@DastardlyHQ.org@21:1/5 to All on Mon Apr 28 09:31:37 2025
    On Mon, 28 Apr 2025 09:44:27 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wibbled:
    Am 28.04.2025 um 09:42 schrieb Janis Papanagnou:

    Why are you repeatedly saying that; it's not true, and examples have
    been provided. If applications are locale-aware - which is standard
    for a long time - you can consistently use what you like.

    There's no standard locale for a filesystem. Linux sucks with that.

    *nix doesn't care about locales for most things including filenames, its
    all just a sequence of bytes. Locales only matter for display such as
    terminal char sets and dates.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Muttley@DastardlyHQ.org@21:1/5 to All on Mon Apr 28 11:01:08 2025
    On Mon, 28 Apr 2025 11:39:26 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wibbled:
    Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:

    *nix doesn't care about locales for most things including filenames, its
    all just a sequence of bytes. Locales only matter for display such as
    terminal char sets and dates.

    Yes, Unix-APIs are really achaic. When you have a filename written

    I'd say logical. Why should the OS give a damn what locale the user is using and hence the filename any more than it should care about whats inside the file?

    with ohne user's locale and another with a different locale reads
    that he get's at most a partitially readable filename. For Janis
    this seems to be flexibility, but for me that's a problem. A file-
    system should have fixed charset, at best Unicode.

    How often would there be users using different locales on the same machine? They'll be using whatever locale the institution that owns the machine uses
    and on their own machines its not a relevant question.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Muttley@DastardlyHQ.org@21:1/5 to All on Mon Apr 28 12:01:34 2025
    On Mon, 28 Apr 2025 13:30:17 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wibbled:
    Am 28.04.2025 um 13:01 schrieb Muttley@DastardlyHQ.org:

    I'd say logical. Why should the OS give a damn what locale the user is using >> and hence the filename any more than it should care about whats inside the >> file?

    To have filenames displayed the same way no matter what locale is
    currently configured.

    Who cares? If someone is that bothered stick to 7 bit ascii. The locale has nothing to do with the OS, its an application and library concern. Why should the OS - for example - waste its time verifying the filename only uses valid
    c odes for the locale and what if someone unzips something that contains filenames with a locale it doesn't support? Refuse to store the file? What a load of BS.

    Just allow a stream of bytes - excluding some forbidden chars such as "/" - and leave it at that.

    Similarly does the OS care what locale DNS names are in? No.

    How often would there be users using different locales on the same machine?

    With Unix there's no locale defined for filesystem operations; it's >arbitrary.

    Its not arbitray - there's no locale , end of.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Mon Apr 28 14:21:43 2025
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 10:08 schrieb Janis Papanagnou:

    My file system (and obviously also the file systems of others that
    are posting here) have no problems with any locale.

    That's the problem: the filesystem should have a specific locale.
    Otherwise you copy some files from a different computer where the
    user has a different locale and you get Swahili-filenames.

    nonsense.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Mon Apr 28 14:24:27 2025
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 11:39 schrieb Bonita Montero:

    Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:

    Yes, Unix-APIs are really achaic. When you have a filename written
    with ohne user's locale and another with a different locale reads
    that he get's at most a partitially readable filename. For Janis
    this seems to be flexibility, but for me that's a problem. A file-
    system should have fixed charset, at best Unicode.

    I did have a look at how macOS / APFS handles this:
    for macOS all filenames are UTF-8.

    No, unix (and macOS _is_ unix) filenames are a simple stream of
    bytes with no meaning or semantic associated with the bytes other than the terminating nul character and the directory separator character.

    Far more flexible than NTFS.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Mon Apr 28 14:30:40 2025
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 13:30 schrieb Bonita Montero:

    How often would there be users using different locales on the same
    machine?

    With Unix there's no locale defined for filesystem operations; it's
    arbitrary.

    And imagine that you have a tar-archive packed by someone with a
    different locale; that's rather likely.

    The data in the tar archive is locale-independent. Always.
    The file name stored in the archive is a stream of bytes,
    it's always the same regardless of the active locale or
    the current font.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Mon Apr 28 16:59:35 2025
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 16:21 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 10:08 schrieb Janis Papanagnou:

    My file system (and obviously also the file systems of others that
    are posting here) have no problems with any locale.

    That's the problem: the filesystem should have a specific locale.
    Otherwise you copy some files from a different computer where the
    user has a different locale and you get Swahili-filenames.

    nonsense.

    No nonsense. If you create some files with extended chars and pack
    them into a tar-file and unpack them on a different machine with
    a different locale you see the wrong charaters.


    Not really. UTF-8 is UTF-8, regardless of the locale.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Mon Apr 28 17:03:46 2025
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 16:24 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 11:39 schrieb Bonita Montero:

    Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:

    Yes, Unix-APIs are really achaic. When you have a filename written
    with ohne user's locale and another with a different locale reads
    that he get's at most a partitially readable filename. For Janis
    this seems to be flexibility, but for me that's a problem. A file-
    system should have fixed charset, at best Unicode.

    I did have a look at how macOS / APFS handles this:
    for macOS all filenames are UTF-8.

    No, unix (and macOS _is_ unix) filenames are a simple stream of
    bytes with no meaning or semantic associated with the bytes other than the >> terminating nul character and the directory separator character.

    The Wikipedia says that APFS is UTF-8 capable. >https://en.wikipedia.org/wiki/Apple_File_System

    So is linux. The operating system ascribes no meaning to the bytes
    stored in the filesystem directories. They're just a stream of bytes.

    One can treat them as UTF-8, which is generally the case. In which case
    your objections about 'garbage' in a different locale are pointless.
    UTF-8 fonts are universal. The currently locale doesn't matter.

    Windows, on the other hand, limits the character set to those that can
    be described in 16-bit units, and the "locale" matters for not only
    display purposes, but also for character processing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Mon Apr 28 17:05:12 2025
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 16:30 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 13:30 schrieb Bonita Montero:

    How often would there be users using different locales on the same
    machine?

    With Unix there's no locale defined for filesystem operations; it's
    arbitrary.

    And imagine that you have a tar-archive packed by someone with a
    different locale; that's rather likely.

    The data in the tar archive is locale-independent. Always.

    The filenames not if they contain characters >= 128.

    You appear to be confused. The mapping of bytes into
    display characters is completely locale independent. Particularly
    for UTF-8 encodings.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Mon Apr 28 20:36:34 2025
    On Mon, 28 Apr 2025 17:03:46 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 16:24 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 11:39 schrieb Bonita Montero:

    Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:

    Yes, Unix-APIs are really achaic. When you have a filename
    written with ohne user's locale and another with a different
    locale reads that he get's at most a partitially readable
    filename. For Janis this seems to be flexibility, but for me
    that's a problem. A file- system should have fixed charset, at
    best Unicode.

    I did have a look at how macOS / APFS handles this:
    for macOS all filenames are UTF-8.

    No, unix (and macOS _is_ unix) filenames are a simple stream of
    bytes with no meaning or semantic associated with the bytes other
    than the terminating nul character and the directory separator
    character.

    The Wikipedia says that APFS is UTF-8 capable. >https://en.wikipedia.org/wiki/Apple_File_System

    So is linux. The operating system ascribes no meaning to the bytes
    stored in the filesystem directories. They're just a stream of
    bytes.


    That's nonsense.
    Every case-preserving case-insensitive file system has to understand
    characters encoding, at least to a certain degree.
    Apple file systems can be configured to be case-sensitive, but it's not
    default and recommended for none-specialist users.

    One can treat them as UTF-8, which is generally the case. In which
    case your objections about 'garbage' in a different locale are
    pointless. UTF-8 fonts are universal. The currently locale doesn't
    matter.

    Windows, on the other hand, limits the character set to those that can
    be described in 16-bit units, and the "locale" matters for not only
    display purposes, but also for character processing.

    It's rather hard to understand what you mean by above sentence.
    If you meant to say that Windows file names have to use only
    characters that were present in [mostly forgotten] UTC-2 character set
    then you are mistaken.
    If you meant something else then please express yourself more clearly.
    If it was your usual instinctive Windows bashing then don't bother.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Mon Apr 28 17:59:40 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 28 Apr 2025 17:03:46 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 16:24 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 11:39 schrieb Bonita Montero:

    Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:

    Yes, Unix-APIs are really achaic. When you have a filename
    written with ohne user's locale and another with a different
    locale reads that he get's at most a partitially readable
    filename. For Janis this seems to be flexibility, but for me
    that's a problem. A file- system should have fixed charset, at
    best Unicode.

    I did have a look at how macOS / APFS handles this:
    for macOS all filenames are UTF-8.

    No, unix (and macOS _is_ unix) filenames are a simple stream of
    bytes with no meaning or semantic associated with the bytes other
    than the terminating nul character and the directory separator
    character.

    The Wikipedia says that APFS is UTF-8 capable.
    https://en.wikipedia.org/wiki/Apple_File_System

    So is linux. The operating system ascribes no meaning to the bytes
    stored in the filesystem directories. They're just a stream of
    bytes.


    That's nonsense.
    Every case-preserving case-insensitive file system has to understand >characters encoding, at least to a certain degree.

    The only cased-insensitive linux file systems (what a monumentally bad idea!) are NTFS and VFAT - neither of which are used for linux except in very
    limited use-cases.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Bonita Montero on Mon Apr 28 20:05:18 2025
    On 28.04.2025 11:10, Bonita Montero wrote:
    Am 28.04.2025 um 10:08 schrieb Janis Papanagnou:

    My file system (and obviously also the file systems of others that
    are posting here) have no problems with any locale.

    That's the problem: the filesystem should have a specific locale.
    Otherwise you copy some files from a different computer where the
    user has a different locale and you get Swahili-filenames.

    Okay, I think I see where you're coming from. Reminds me my formerly
    view on the file names topic; some decades ago I argued that it might
    be good to have only ASCII texts allowed for file names, specifically
    no control characters (and maybe even some more characters), to avoid
    some common issues with such characters. Needless to say that with
    such a "standard" we wouldn't have been able to support I18N. So some
    decades ago I changed my opinion on that. (Note that I was not saying
    that this is the same as your opinion, but there's similarities; to
    have well-defined "transfer syntax" including the character set.)

    The historic architecture of Linux file systems is able to represent
    files having file names in arbitrary languages. That's why the Unix
    file systems don't show the issues that other (popular) OSes show.

    Windows only has UTF-16-filenames and nov varying locale.

    (I thought Windows would use "UCS2". Anyway; would 16 bit suffice to
    support full Unicode; I thought it wouldn't, or only old restricted
    versions of Unicode.)

    But first lets speak about [character] "encodings" (not "locales");
    I re-insert the snipped paragraph.

    Generally, and specifically if you choose to use international
    characters for file names, the prevalent and nowadays the de facto
    standard is to use an UTF-8 encoding.

    Above you said:
    Otherwise you copy some files from a different computer where the
    user has a different locale and you get Swahili-filenames.

    Interoperability requires standards, also in the character encoding.
    Or else some conversion will be necessary. Nowadays the most common
    and most widely used encoding standard seems to be UTF-8 (not UCS2
    and not UTF-16). In cases where you exchange data with systems that
    do not use that de facto standard you have to convert the data. And
    there's tools to do that for you, like 'iconv'.

    A coupling of the file system with a fixed character encoding would
    have prevented I18N, as I said above, but it's also not necessary to
    couple those.

    As long as Windows continues using its own "standards" I understand
    that some [Windows-]folks are angrily cussing systems that rely on
    prevalent standards.

    (So we can agree to disagree on the file system and encoding topic.)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Janis Papanagnou on Mon Apr 28 18:29:45 2025
    On 2025-04-28, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 28.04.2025 11:10, Bonita Montero wrote:
    Am 28.04.2025 um 10:08 schrieb Janis Papanagnou:

    My file system (and obviously also the file systems of others that
    are posting here) have no problems with any locale.

    That's the problem: the filesystem should have a specific locale.
    Otherwise you copy some files from a different computer where the
    user has a different locale and you get Swahili-filenames.

    Okay, I think I see where you're coming from.

    A plywood shack under the bridge.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Michael S on Mon Apr 28 18:28:46 2025
    On 2025-04-28, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 28 Apr 2025 17:03:46 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 16:24 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 11:39 schrieb Bonita Montero:

    Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:

    Yes, Unix-APIs are really achaic. When you have a filename
    written with ohne user's locale and another with a different
    locale reads that he get's at most a partitially readable
    filename. For Janis this seems to be flexibility, but for me
    that's a problem. A file- system should have fixed charset, at
    best Unicode.

    I did have a look at how macOS / APFS handles this:
    for macOS all filenames are UTF-8.

    No, unix (and macOS _is_ unix) filenames are a simple stream of
    bytes with no meaning or semantic associated with the bytes other
    than the terminating nul character and the directory separator
    character.

    The Wikipedia says that APFS is UTF-8 capable.
    https://en.wikipedia.org/wiki/Apple_File_System

    So is linux. The operating system ascribes no meaning to the bytes
    stored in the filesystem directories. They're just a stream of
    bytes.


    That's nonsense.
    Every case-preserving case-insensitive file system has to understand characters encoding, at least to a certain degree.

    Commonly used filesystems on Linux are case sensitive.

    Apple file systems can be configured to be case-sensitive, but it's not default and recommended for none-specialist users.

    It's a really, really idiotic thing tht has become more so in
    the Unicode age.

    Why is there case insensitivity at all? Because some people
    can't handle 'a' and 'A' being different.

    But in Unicode, there are multiple ways of encoding characters
    such that they result looks exactly the same.

    If you're going to treat 'a' and 'A' the same, you must
    also treat multiple ways of encoding the same character
    as the same: whether it is multiple code points that combine
    as a grapheme cluster, or a dedicated code point or whatever.

    If you treat 'a' and 'A' as the same, but different grapheme
    clusters denothing the same character as different, you have
    a gaping inconsistency.

    To correctly do the idiotic thing you're doing, you have
    to bake knowledge of all Unicode into your namei() routine,
    along with megabytes of data to support it.

    Or you could just compare byte strings and punt that to the user as
    their problem if they write the same name in two ways.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Harnden@21:1/5 to Bonita Montero on Mon Apr 28 19:47:33 2025
    On 28/04/2025 19:36, Bonita Montero wrote:
    Am 28.04.2025 um 18:59 schrieb Scott Lurndal:

    Not really.  UTF-8 is UTF-8, regardless of the locale.

    But UTF-8 isn't the standard locale for Unix filesystems
    except with macOS.


    UTF-8 isn't a locale - it's an encoding.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Kaz Kylheku on Mon Apr 28 22:29:14 2025
    On Mon, 28 Apr 2025 18:28:46 -0000 (UTC)
    Kaz Kylheku <643-408-1753@kylheku.com> wrote:

    On 2025-04-28, Michael S <already5chosen@yahoo.com> wrote:

    That's nonsense.
    Every case-preserving case-insensitive file system has to understand characters encoding, at least to a certain degree.

    Commonly used filesystems on Linux are case sensitive.


    Scott's claim was about *all* Unixes and in his previous message he
    emphasized that he classifies Apple OS/X as Unix.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Michael S on Mon Apr 28 20:12:21 2025
    On 2025-04-28, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 28 Apr 2025 18:28:46 -0000 (UTC)
    Kaz Kylheku <643-408-1753@kylheku.com> wrote:

    On 2025-04-28, Michael S <already5chosen@yahoo.com> wrote:

    That's nonsense.
    Every case-preserving case-insensitive file system has to understand
    characters encoding, at least to a certain degree.

    Commonly used filesystems on Linux are case sensitive.


    Scott's claim was about *all* Unixes and in his previous message he emphasized that he classifies Apple OS/X as Unix.

    Apple breaks Unix in some deranged attempt to second-guess
    what is good for "the rest of them".

    This issue is no longer relevant to users in 2025. Users do not
    type entire filenames from scratch in order to open files. They
    click on things, or possibly use search/completion.

    It is the application level file search features that need to pander to
    issues like case insensitivity.

    Case sensitivity means that a user could create a file "Foo.png" and
    another "foo.PNG", and then have them appear in the folder side by side.
    Well, big whoopty whoop. They can double click on either of them to see
    what it is. They are not gonna type the name to do that! If they don't
    like it that there are two files with similar names they can click on
    either name, and change it.

    Now when the user searches for "foo", or for PNG files the finder should
    find both files. But that doesn't require the filesystem to be case insensitive, only the search logic in the finder. That's where you can
    go to town with supporting Unicode normalization and whatnot.


    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Michael S on Mon Apr 28 22:23:33 2025
    On Mon, 28 Apr 2025 20:36:34 +0300
    Michael S <already5chosen@yahoo.com> wrote:

    On Mon, 28 Apr 2025 17:03:46 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 16:24 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 11:39 schrieb Bonita Montero:

    Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:

    Yes, Unix-APIs are really achaic. When you have a filename
    written with ohne user's locale and another with a different
    locale reads that he get's at most a partitially readable
    filename. For Janis this seems to be flexibility, but for me
    that's a problem. A file- system should have fixed charset, at
    best Unicode.

    I did have a look at how macOS / APFS handles this:
    for macOS all filenames are UTF-8.

    No, unix (and macOS _is_ unix) filenames are a simple stream of
    bytes with no meaning or semantic associated with the bytes other
    than the terminating nul character and the directory separator
    character.

    The Wikipedia says that APFS is UTF-8 capable. >https://en.wikipedia.org/wiki/Apple_File_System

    So is linux. The operating system ascribes no meaning to the bytes
    stored in the filesystem directories. They're just a stream of
    bytes.


    That's nonsense.
    Every case-preserving case-insensitive file system has to understand characters encoding, at least to a certain degree.
    Apple file systems can be configured to be case-sensitive, but it's
    not default and recommended for none-specialist users.


    Please read "and not recommended for"

    One can treat them as UTF-8, which is generally the case. In which
    case your objections about 'garbage' in a different locale are
    pointless. UTF-8 fonts are universal. The currently locale doesn't
    matter.

    Windows, on the other hand, limits the character set to those that
    can be described in 16-bit units, and the "locale" matters for not
    only display purposes, but also for character processing.

    It's rather hard to understand what you mean by above sentence.
    If you meant to say that Windows file names have to use only
    characters that were present in [mostly forgotten] UTC-2 character set
    then you are mistaken.

    Please read "UCS-2"

    If you meant something else then please express yourself more clearly.
    If it was your usual instinctive Windows bashing then don't bother.




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Bonita Montero on Tue Apr 29 01:13:58 2025
    On 28.04.2025 20:38, Bonita Montero wrote:
    Am 28.04.2025 um 20:05 schrieb Janis Papanagnou:

    (I thought Windows would use "UCS2". Anyway; would 16 bit suffice to
    support full Unicode; I thought it wouldn't, or only old restricted
    versions of Unicode.)

    Windows is UTF-16 since Windows 2000, UCS2 before.

    Oh, so it's a multi-word encoding (like UTF-8 is a multi-byte encoding)
    and a character not necessarily encoded with only one 16 bit word... -
    ...but then I wonder even more where you see an advantage.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Heathfield@21:1/5 to Bonita Montero on Tue Apr 29 00:24:45 2025
    On 28/04/2025 22:26, Bonita Montero wrote:
    Am 28.04.2025 um 20:47 schrieb Richard Harnden:
    On 28/04/2025 19:36, Bonita Montero wrote:
    Am 28.04.2025 um 18:59 schrieb Scott Lurndal:

    Not really.  UTF-8 is UTF-8, regardless of the locale.

    But UTF-8 isn't the standard locale for Unix filesystems
    except with macOS.


    UTF-8 isn't a locale - it's an encoding.

    Idiot.
    Type "locale" in the shell and thenn return.

    As David knows and you apparently don't, UTF-8 is an encoding,
    not a locale.

    If you must call people idiots, it's probably wisest to make sure
    first that you're on solid ground.

    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Bonita Montero on Mon Apr 28 20:50:18 2025
    On 28/04/2025 22:26, Bonita Montero wrote:
    Am 28.04.2025 um 20:47 schrieb Richard Harnden:
    ...
    UTF-8 isn't a locale - it's an encoding.

    Idiot.
    Type "locale" in the shell and thenn return.

    On my system, that produces:

    LANG=en_US.UTF-8
    LANGUAGE=en_US:en
    LC_CTYPE="en_US.UTF-8"
    LC_NUMERIC=en_US.UTF-8
    LC_TIME=en_US.UTF-8
    LC_COLLATE="en_US.UTF-8"
    LC_MONETARY=en_US.UTF-8
    LC_MESSAGES="en_US.UTF-8"
    LC_PAPER=en_US.UTF-8
    LC_NAME=en_US.UTF-8
    LC_ADDRESS=en_US.UTF-8
    LC_TELEPHONE=en_US.UTF-8
    LC_MEASUREMENT=en_US.UTF-8
    LC_IDENTIFICATION=en_US.UTF-8

    Try "locale -a" for more complete information. On my system the result is:


    C
    C.utf8
    de_AT.utf8
    de_BE.utf8
    de_CH.utf8
    de_DE.utf8
    de_IT.utf8
    de_LI.utf8
    de_LU.utf8
    en_AG
    en_AG.utf8
    en_AU.utf8
    en_BW.utf8
    en_CA.utf8
    en_DK.utf8
    en_GB.utf8
    en_HK.utf8
    en_IE.utf8
    en_IL
    en_IL.utf8
    en_IN
    en_IN.utf8
    en_NG
    en_NG.utf8
    en_NZ.utf8
    en_PH.utf8
    en_SG.utf8
    en_US.utf8
    en_ZA.utf8
    en_ZM
    en_ZM.utf8
    en_ZW.utf8
    es_AR.utf8
    es_BO.utf8
    es_CL.utf8
    es_CO.utf8
    es_CR.utf8
    es_CU
    es_CU.utf8
    es_DO.utf8
    es_EC.utf8
    es_ES.utf8
    es_GT.utf8
    es_HN.utf8
    es_MX.utf8
    es_NI.utf8
    es_PA.utf8
    es_PE.utf8
    es_PR.utf8
    es_PY.utf8
    es_SV.utf8
    es_US.utf8
    es_UY.utf8
    es_VE.utf8
    POSIX
    ru_RU.utf8
    ru_UA.utf8
    uk_UA.utf8
    zh_HK.utf8
    zh_TW.utf8

    I've installed all of the languages that I have at least a passing
    acquaintance with. All of those locales with "utf8" at the end use the
    same encoding (UTF-8, oddly enough).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Tue Apr 29 00:28:17 2025
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 20:47 schrieb Richard Harnden:
    On 28/04/2025 19:36, Bonita Montero wrote:
    Am 28.04.2025 um 18:59 schrieb Scott Lurndal:

    Not really.  UTF-8 is UTF-8, regardless of the locale.

    But UTF-8 isn't the standard locale for Unix filesystems
    except with macOS.


    UTF-8 isn't a locale - it's an encoding.

    Idiot.
    Type "locale" in the shell and thenn return.

    $ locale
    LANG=C
    LC_CTYPE="C"
    LC_NUMERIC=C
    LC_TIME="C"
    LC_COLLATE="C"
    LC_MONETARY="C"
    LC_MESSAGES="C"
    LC_PAPER="C"
    LC_NAME="C"
    LC_ADDRESS="C"
    LC_TELEPHONE="C"
    LC_MEASUREMENT="C"
    LC_IDENTIFICATION="C"
    LC_ALL=

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Tue Apr 29 02:37:56 2025
    On Mon, 28 Apr 2025 09:28:14 +0200, Bonita Montero wrote:

    Am 28.04.2025 um 06:47 schrieb Lawrence D'Oliveiro:

    On Mon, 28 Apr 2025 06:28:05 +0200, Bonita Montero wrote:

    Am 28.04.2025 um 03:22 schrieb Lawrence D'Oliveiro:

    That's not true. The codepoints for the surrogates were unused before.

    The problem is the fact that you have to deal with surrogates.

    That's trivial.

    I had to deal with it in Java code. It’s not trivial.

    Far easier to have systems, like Python or Linux, which can deal with full Unicode in a more native fashion.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Tue Apr 29 02:39:19 2025
    On Mon, 28 Apr 2025 09:44:27 +0200, Bonita Montero wrote:

    There's no standard locale for a filesystem. Linux sucks with that.

    Is your mission in life to try to make Windows look better than Linux?

    You realize that’s futile, don’t you?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Heathfield@21:1/5 to Bonita Montero on Tue Apr 29 07:25:18 2025
    On 29/04/2025 06:36, Bonita Montero wrote:
    Am 29.04.2025 um 01:24 schrieb Richard Heathfield:
    On 28/04/2025 22:26, Bonita Montero wrote:
    Am 28.04.2025 um 20:47 schrieb Richard Harnden:
    On 28/04/2025 19:36, Bonita Montero wrote:
    Am 28.04.2025 um 18:59 schrieb Scott Lurndal:

    Not really.  UTF-8 is UTF-8, regardless of the locale.

    But UTF-8 isn't the standard locale for Unix filesystems
    except with macOS.


    UTF-8 isn't a locale - it's an encoding.

    Idiot.
    Type "locale" in the shell and thenn return.

    As David knows and you apparently don't, UTF-8 is an encoding,
    not a locale.

    If you must call people idiots, it's probably wisest to make
    sure first that you're on solid ground.

    UTF-8 has a locale, the chars between 128 and 255 have the locale
    Latin 1.

    A dog has a tail, but that doesn't mean a tail is a dog. Whatever
    UTF-8 may or may not have, it's an encoding, not a locale.

    David is right, and you are mistaken. If you must call people
    idiots, it's probably wisest to make sure first that you're on
    solid ground.

    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Tue Apr 29 07:23:39 2025
    On Mon, 28 Apr 2025 20:39:15 +0200, Bonita Montero wrote:

    There are no locales with UTF-16.

    Locales have to do with more than character encoding.

    <https://sourceware.org/glibc/wiki/Locales>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Tue Apr 29 07:25:16 2025
    On Tue, 29 Apr 2025 07:37:58 +0200, Bonita Montero wrote:

    ... but the Win32-APIs are more mature than the Posix-APIs ...

    If only ...

    Note the limitations on Windows with <https://docs.python.org/3/library/select.html>, just for example.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Tue Apr 29 07:26:33 2025
    On Mon, 28 Apr 2025 22:29:14 +0300, Michael S wrote:

    ... he classifies Apple OS/X as Unix.

    That is the only real “Unix” left. Linux is officially not “Unix”.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Tue Apr 29 07:28:09 2025
    On Mon, 28 Apr 2025 13:30:17 +0200, Bonita Montero wrote:

    With Unix there's no locale defined for filesystem operations; it's arbitrary.

    Don’t confuse “Unix” with “Linux”. On Linux, ASCII “/” is the pathname
    component separator, and ASCII NUL is the pathname terminator. Everything
    else is simply passed through as is. In particular, I can use “∕” as part of a pathname component, if I want.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Muttley@DastardlyHQ.org@21:1/5 to All on Tue Apr 29 07:40:33 2025
    On Mon, 28 Apr 2025 20:36:34 +0300
    Michael S <already5chosen@yahoo.com> wibbled:
    On Mon, 28 Apr 2025 17:03:46 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:
    So is linux. The operating system ascribes no meaning to the bytes
    stored in the filesystem directories. They're just a stream of
    bytes.


    That's nonsense.
    Every case-preserving case-insensitive file system has to understand >characters encoding, at least to a certain degree.

    Case insensitve file systems are an abortion that no sane OS should use.
    Far too much scope for mistakes not to mention the problem of unzipping
    from a case sensitive one.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Tue Apr 29 07:30:21 2025
    On Tue, 29 Apr 2025 07:40:57 +0200, Bonita Montero wrote:

    Am 29.04.2025 um 04:37 schrieb Lawrence D'Oliveiro:

    On Mon, 28 Apr 2025 09:28:14 +0200, Bonita Montero wrote:

    Am 28.04.2025 um 06:47 schrieb Lawrence D'Oliveiro:

    On Mon, 28 Apr 2025 06:28:05 +0200, Bonita Montero wrote:

    Am 28.04.2025 um 03:22 schrieb Lawrence D'Oliveiro:

    That's not true. The codepoints for the surrogates were unused
    before.

    The problem is the fact that you have to deal with surrogates.

    That's trivial.

    I had to deal with it in Java code. It’s not trivial.

    Far easier to have systems, like Python or Linux, which can deal with
    full Unicode in a more native fashion.

    I've got my u16_feeder iterator for that I'n using for years.
    It has the same semantics like a pointer in C++ and it's as easy to use.

    So you had to roll your own code to deal with it. That’s OK. On POSIX systems, we have iconv(3) when we need it.

    <https://manpages.debian.org/iconv(3)>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Tue Apr 29 09:47:40 2025
    On 29.04.2025 09:26, Lawrence D'Oliveiro wrote:
    On Mon, 28 Apr 2025 22:29:14 +0300, Michael S wrote:

    ... he classifies Apple OS/X as Unix.

    That is the only real “Unix” left. Linux is officially not “Unix”.

    I'm not sure what sort of "officially" you have in mind.

    As opposed to UNIX, a trademark and originally identifying the AT&T
    version of a Unix system, the term Unix is usually used to classify
    the _family_ of these operating systems. But MacOS X is in the line
    of BSD Unixes (not AT&T). So both, Linux and MacOS X are Unixes.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Tue Apr 29 09:51:00 2025
    On 29.04.2025 09:28, Lawrence D'Oliveiro wrote:
    On Mon, 28 Apr 2025 13:30:17 +0200, Bonita Montero wrote:

    With Unix there's no locale defined for filesystem operations; it's
    arbitrary.

    Don’t confuse “Unix” with “Linux”. On Linux, ASCII “/” is the pathname
    component separator, and ASCII NUL is the pathname terminator. Everything else is simply passed through as is. In particular, I can use “∕” as part
    of a pathname component, if I want.

    If I'm not mistaken these two characters have been used in _all_ Unix
    systems (UNIX, BSD, ...) as special file system characters.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Janis Papanagnou on Tue Apr 29 11:17:50 2025
    On Tue, 29 Apr 2025 09:47:40 +0200
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    On 29.04.2025 09:26, Lawrence D'Oliveiro wrote:
    On Mon, 28 Apr 2025 22:29:14 +0300, Michael S wrote:

    ... he classifies Apple OS/X as Unix.

    That is the only real “Unix” left. Linux is officially not “Unix”.

    I'm not sure what sort of "officially" you have in mind.


    The only meaning of "officially" I can think about is "licensed to use
    the UNIX trademark by its current owner Open Group".
    According to Wikipedia: "Systems that have been licensed to use the
    UNIX trademark include AIX,[44] EulerOS,[45] HP-UX,[46] Inspur
    K-UX,[47] IRIX,[48] macOS,[49] Solaris,[50] Tru64 UNIX (formerly
    "Digital UNIX", or OSF/1),[51] and z/OS".
    There are 2 Linux distros in this list - EulerOS and Inspur K-UX.

    From others on the list, IRIX and Tru64 are dead, HP-UX is almost dead,
    AIX and Solaris are alive, but considered legacy systems by their
    owners and see very little new development. z/Os is alive and in good
    shape, but everybody knows that despite the trademark it is not similar
    to Unix.

    As opposed to UNIX, a trademark and originally identifying the AT&T
    version of a Unix system, the term Unix is usually used to classify
    the _family_ of these operating systems. But MacOS X is in the line
    of BSD Unixes (not AT&T). So both, Linux and MacOS X are Unixes.

    Janis


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Tue Apr 29 11:29:29 2025
    On Tue, 29 Apr 2025 00:28:17 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 20:47 schrieb Richard Harnden:
    On 28/04/2025 19:36, Bonita Montero wrote:
    Am 28.04.2025 um 18:59 schrieb Scott Lurndal:

    Not really.  UTF-8 is UTF-8, regardless of the locale.

    But UTF-8 isn't the standard locale for Unix filesystems
    except with macOS.


    UTF-8 isn't a locale - it's an encoding.

    Idiot.
    Type "locale" in the shell and thenn return.

    $ locale
    LANG=C
    LC_CTYPE="C"
    LC_NUMERIC=C
    LC_TIME="C"
    LC_COLLATE="C"
    LC_MONETARY="C"
    LC_MESSAGES="C"
    LC_PAPER="C"
    LC_NAME="C"
    LC_ADDRESS="C"
    LC_TELEPHONE="C"
    LC_MEASUREMENT="C"
    LC_IDENTIFICATION="C"
    LC_ALL=

    What do you see with 'locale -a' ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Richard Heathfield on Tue Apr 29 10:36:34 2025
    On 29/04/2025 01:24, Richard Heathfield wrote:
    On 28/04/2025 22:26, Bonita Montero wrote:
    Am 28.04.2025 um 20:47 schrieb Richard Harnden:
    On 28/04/2025 19:36, Bonita Montero wrote:
    Am 28.04.2025 um 18:59 schrieb Scott Lurndal:

    Not really.  UTF-8 is UTF-8, regardless of the locale.

    But UTF-8 isn't the standard locale for Unix filesystems
    except with macOS.


    UTF-8 isn't a locale - it's an encoding.

    Idiot.
    Type "locale" in the shell and thenn return.

    As David knows and you apparently don't, UTF-8 is an encoding, not a
    locale.


    Which "David" are you referring to here? If it is me, then yes, I know
    UTF-8 is an encoding and not a locale. I even knew that before Richard
    Harnden wrote it in the post you quoted :-)


    But it is also true that the character encoding is often specified as
    part of the locale information. That is used when determining the codes
    for characters typed at the keyboard, and conversely for converting
    codes into characters for display. I think most people who use Linux
    and who use characters beyond ASCII will use a locale with UTF-8
    encoding. There might be some English-only speakers who use a locale
    with ISO-8859-1 or ISO-8859-15, as well as some legacy users of other encodings, some specialist users, and on servers the encoding is
    typically irrelevant.

    Filesystems on *nix, and most of the rest of *nix systems, do not care
    about encoding - they just deal with a string of bytes, which is
    obviously the correct way to handle them. Windows only needs
    locale-aware filesystems and filesystem APIs because MS made the mistake
    of using case-insensitive filesystems in their early days, and are stuck
    with that.


    If you must call people idiots, it's probably wisest to make sure first
    that you're on solid ground.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Janis Papanagnou on Tue Apr 29 10:58:46 2025
    On 29/04/2025 01:13, Janis Papanagnou wrote:
    On 28.04.2025 20:38, Bonita Montero wrote:
    Am 28.04.2025 um 20:05 schrieb Janis Papanagnou:

    (I thought Windows would use "UCS2". Anyway; would 16 bit suffice to
    support full Unicode; I thought it wouldn't, or only old restricted
    versions of Unicode.)

    Windows is UTF-16 since Windows 2000, UCS2 before.


    No, Windows has had /some/ UTF-16 support since W2K, with gradual
    improvements over time to APIs, filesystems, and applications. Later
    on, it started getting /some/ UTF-8 support, which is a much better
    choice for most uses.

    Oh, so it's a multi-word encoding (like UTF-8 is a multi-byte encoding)
    and a character not necessarily encoded with only one 16 bit word... -
    ...but then I wonder even more where you see an advantage.


    When Unicode started, they thought 16 bits would be enough. UCS2 made
    sense then, because it was a fixed size encoding - though it had the
    huge disadvantages of being endian-dependent and totally incompatible
    with every existing character set. Early Unicode adopters included
    Windows NT and NTFS, Java, QT and Python, using UCS2.

    Once Unicode was extended beyond 16 bits, three new encodings emerged -
    UCS4 (32-bit fixed size), UTF-8 and UTF-16. UCS4 has the advantage of
    being fixed size (that turns out to be a minor issue in practice, but
    was long thought to be important), but like UCS2 it suffers from
    endianness, and is inefficient in size (but is easily compressed, so
    that also does not matter as much as many people think). UCS4 covers
    all code points in Unicode, but combining characters mean it still does
    not cover all characters in one code unit.

    UTF-16 is variable length, and can encode any Unicode code point. It
    has the advantage that existing UCS2 is a subset, making it a natural
    extension for UCS2 systems - but keeps the same disadvantages of
    inefficiency for common ASCII characters, incompatibility with ASCII, endian-dependent, and it requires dedicated functions for almost everything.

    The biggest problem with UTF-16 IMHO is that it delayed adoption of
    UTF-8 on early Unicode software. Changing something like QT or Windows
    from UCS2 to UTF-8 is not easy, but it would have been much better in
    the long run if that had been done without changing to UTF-16 first.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Muttley@DastardlyHQ.org@21:1/5 to All on Tue Apr 29 10:12:41 2025
    On Tue, 29 Apr 2025 09:40:36 +0200
    Bonita Montero <Bonita.Montero@gmail.com> wibbled:
    Am 29.04.2025 um 09:25 schrieb Lawrence D'Oliveiro:

    If only ...
    Note the limitations on Windows with
    <https://docs.python.org/3/library/select.html>, just for example.

    Windows has I/O-completion ports which are more flexible than select().

    No API which requires a seperate thread for each channel is "flexible".
    The whole point of select and poll is they allow multiplexing in a single threaded program. If you're going to use multiple threads you might as well just sit in a blocking read() or write(), you don't need a seperate API at
    all.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Muttley@DastardlyHQ.org@21:1/5 to All on Tue Apr 29 10:17:12 2025
    On Tue, 29 Apr 2025 10:58:46 +0200
    David Brown <david.brown@hesbynett.no> wibbled:
    On 29/04/2025 01:13, Janis Papanagnou wrote:
    Oh, so it's a multi-word encoding (like UTF-8 is a multi-byte encoding)
    and a character not necessarily encoded with only one 16 bit word... -
    ...but then I wonder even more where you see an advantage.


    When Unicode started, they thought 16 bits would be enough. UCS2 made

    They only had to check how many chinese pictograms there are to realise
    that it was never going to be enough. Perhaps chinese wasn't considered important back then.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Heathfield@21:1/5 to David Brown on Tue Apr 29 12:23:09 2025
    On 29/04/2025 09:36, David Brown wrote:
    Which "David" are you referring to here?

    Would you believe Richard D Harnden?

    ... No, thought not. (Good spot.)

    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Tue Apr 29 12:59:42 2025
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 29.04.2025 um 04:39 schrieb Lawrence D'Oliveiro:

    On Mon, 28 Apr 2025 09:44:27 +0200, Bonita Montero wrote:

    There's no standard locale for a filesystem. Linux sucks with that.

    Is your mission in life to try to make Windows look better than Linux?
    You realize that’s futile, don’t you?

    Not generally, but the Win32-APIs are more mature than the Posix-APIs,
    but the implementation is slower.

    More mature? You're joking, right? The POSIX API was mature when
    windows 3.1 was vogue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Tue Apr 29 12:57:34 2025
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 29.04.2025 um 02:28 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 28.04.2025 um 20:47 schrieb Richard Harnden:
    On 28/04/2025 19:36, Bonita Montero wrote:
    Am 28.04.2025 um 18:59 schrieb Scott Lurndal:

    Not really.  UTF-8 is UTF-8, regardless of the locale.

    But UTF-8 isn't the standard locale for Unix filesystems
    except with macOS.


    UTF-8 isn't a locale - it's an encoding.

    Idiot.
    Type "locale" in the shell and thenn return.

    $ locale
    LANG=C
    LC_CTYPE="C"
    LC_NUMERIC=C
    LC_TIME="C"
    LC_COLLATE="C"
    LC_MONETARY="C"
    LC_MESSAGES="C"
    LC_PAPER="C"
    LC_NAME="C"
    LC_ADDRESS="C"
    LC_TELEPHONE="C"
    LC_MEASUREMENT="C"
    LC_IDENTIFICATION="C"
    LC_ALL=

    For me:

    boni@Raubtier-Asyl:/mnt/c/Users/Boni$ locale
    LANG=C.UTF-8
    LANGUAGE=

    Same locale, different encoding. As has been pointed out
    to you repeatedly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Muttley@DastardlyHQ.org on Tue Apr 29 15:06:52 2025
    On 29/04/2025 12:17, Muttley@DastardlyHQ.org wrote:
    On Tue, 29 Apr 2025 10:58:46 +0200
    David Brown <david.brown@hesbynett.no> wibbled:
    On 29/04/2025 01:13, Janis Papanagnou wrote:
    Oh, so it's a multi-word encoding (like UTF-8 is a multi-byte encoding)
    and a character not necessarily encoded with only one 16 bit word... -
    ...but then I wonder even more where you see an advantage.


    When Unicode started, they thought 16 bits would be enough. UCS2 made

    They only had to check how many chinese pictograms there are to realise
    that it was never going to be enough. Perhaps chinese wasn't considered important back then.


    I am sure it was considered important - but it might have been
    considered too ambitious to include CJK languages in Unicode in the
    early days. Still, I agree they could have looked a little deeper into
    their crystal ball before deciding that 64K characters were enough for everyone.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Wed Apr 30 01:52:09 2025
    On Tue, 29 Apr 2025 09:41:18 +0200, Bonita Montero wrote:

    Am 29.04.2025 um 09:28 schrieb Lawrence D'Oliveiro:

    On Mon, 28 Apr 2025 18:56:56 +0200, Bonita Montero wrote:

    The data in the tar archive is locale-independent. Always.

    The filenames not if they contain characters >= 128.

    Doesn’t matter. They will still pack/unpack correctly.

    Depends on the locale of the person who sees the filenames.

    Bytes is bytes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Muttley on Wed Apr 30 01:54:06 2025
    On Tue, 29 Apr 2025 07:40:33 -0000 (UTC), Muttley wrote:

    Case insensitve file systems are an abortion that no sane OS should use.

    Linux at least offers the option.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Wed Apr 30 01:53:35 2025
    On Tue, 29 Apr 2025 11:17:50 +0300, Michael S wrote:

    z/Os is alive and in good shape, but everybody knows that despite
    the trademark it is not similar to Unix.

    Just goes to show the worthlessness of the “Unix” name nowadays.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Wed Apr 30 07:12:33 2025
    On Tue, 29 Apr 2025 09:40:36 +0200, Bonita Montero wrote:

    Am 29.04.2025 um 09:25 schrieb Lawrence D'Oliveiro:

    If only ...

    Note the limitations on Windows with
    <https://docs.python.org/3/library/select.html>, just for example.

    Windows has I/O-completion ports which are more flexible than select().

    Except they have their own limitations <https://docs.python.org/3/library/asyncio-platforms.html>.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Muttley@DastardlyHQ.org@21:1/5 to All on Wed Apr 30 07:17:43 2025
    On Wed, 30 Apr 2025 01:54:06 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wibbled:
    On Tue, 29 Apr 2025 07:40:33 -0000 (UTC), Muttley wrote:

    Case insensitve file systems are an abortion that no sane OS should use.

    Linux at least offers the option.

    AFAIK there's no case insensitive filesystem for linux.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Muttley@DastardlyHQ.org on Wed Apr 30 09:45:20 2025
    On 30/04/2025 09:17, Muttley@DastardlyHQ.org wrote:
    On Wed, 30 Apr 2025 01:54:06 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wibbled:
    On Tue, 29 Apr 2025 07:40:33 -0000 (UTC), Muttley wrote:

    Case insensitve file systems are an abortion that no sane OS should use.

    Linux at least offers the option.

    AFAIK there's no case insensitive filesystem for linux.


    ext4 supports case-insensitive directories. You need a relatively
    modern kernel, an ext4 filesystem formatted with the "casefold" option
    enabled, then set the case folding extended attribute on an empty
    directory. So it's not something you'd do unintentionally, but it is
    certainly supported. The primary use-case is for Wine for Windows
    programs and games, and it might also be useful for Samba servers.

    More relevant to this group, it make also be convenient for people
    trying to work with big C code bases that were written on Windows and
    you now want to compile (for whatever target you want) them on Linux.
    I've seen code bases developed on Windows machines where the
    capitalisation of include directives was inconsistent - that works on case-insensitive filesystems, but not on case-sensitive systems. (Yes,
    I know there are many other ways to deal with such issues, but putting
    the source code in a case-insensitive directory on ext4 is one option.)

    Linus Torvalds has just had one of his famous rants in reference to case-insensitive options for Bcachefs :

    <https://www.phoronix.com/news/Linus-Torvalds-Anti-Case-Fold>


    You are going to have trouble if you try to install a typical Linux distribution on a case-insensitive filesystem, but Linux does have at
    least some support in some filesystems.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Muttley on Wed Apr 30 07:54:58 2025
    On Tue, 29 Apr 2025 10:17:12 -0000 (UTC), Muttley wrote:

    They only had to check how many chinese pictograms there are to realise
    that it was never going to be enough.

    Actually, the Great (and controversial) CJKV Unification was what made 16-
    bit Unicode possible to begin with.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Muttley on Wed Apr 30 07:58:04 2025
    On Wed, 30 Apr 2025 07:17:43 -0000 (UTC), Muttley wrote:

    On Wed, 30 Apr 2025 01:54:06 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wibbled:

    On Tue, 29 Apr 2025 07:40:33 -0000 (UTC), Muttley wrote:

    Case insensitve file systems are an abortion that no sane OS should
    use.

    Linux at least offers the option.

    AFAIK there's no case insensitive filesystem for linux.

    You didn’t know? <https://lwn.net/Articles/784041/>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Muttley@DastardlyHQ.org@21:1/5 to All on Wed Apr 30 09:06:18 2025
    On Wed, 30 Apr 2025 09:45:20 +0200
    David Brown <david.brown@hesbynett.no> wibbled:
    More relevant to this group, it make also be convenient for people
    trying to work with big C code bases that were written on Windows and
    you now want to compile (for whatever target you want) them on Linux.
    I've seen code bases developed on Windows machines where the
    capitalisation of include directives was inconsistent - that works on >case-insensitive filesystems, but not on case-sensitive systems. (Yes,
    I know there are many other ways to deal with such issues, but putting
    the source code in a case-insensitive directory on ext4 is one option.)

    I've seen on more than one occasion C++ (not C yet) projects where there
    were 2 files only different in case, eg: Network.cpp and network.cpp where
    the former would be the class and the latter would be procedural support code. Good luck unzipping that on Windows or any other case insensitive file system.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Wed Apr 30 11:40:17 2025
    On 30.04.2025 03:53, Lawrence D'Oliveiro wrote:
    On Tue, 29 Apr 2025 11:17:50 +0300, Michael S wrote:

    z/Os is alive and in good shape, but everybody knows that despite
    the trademark it is not similar to Unix.

    Just goes to show the worthlessness of the “Unix” name nowadays.

    "UNIX" has a meaning that varied historically. But "Unix" is
    commonly used as a name for the family of "UNIX-like" systems;
    that's very useful since it allows to formulate commonalities
    of this OS family.[*]

    Janis

    [*] As we've seen in the discussion of Unix file systems with
    its basic structure of being built by sequences of octets[**]
    and having two distinguished characters '\0' and '/'.

    [**] BTW; does anyone know how e.g. the [historic] Borroughs
    Unix systems with their 9 bit/36 bit architecture had their
    file systems defined (w.r.t. the octet transfer syntax)?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Muttley@DastardlyHQ.org on Wed Apr 30 11:52:25 2025
    On 30.04.2025 11:06, Muttley@DastardlyHQ.org wrote:
    On Wed, 30 Apr 2025 09:45:20 +0200
    David Brown <david.brown@hesbynett.no> wibbled:
    More relevant to this group, it make also be convenient for people
    trying to work with big C code bases that were written on Windows and
    you now want to compile (for whatever target you want) them on Linux.
    I've seen code bases developed on Windows machines where the
    capitalisation of include directives was inconsistent - that works on
    case-insensitive filesystems, but not on case-sensitive systems. (Yes,
    I know there are many other ways to deal with such issues, but putting
    the source code in a case-insensitive directory on ext4 is one option.)

    I've seen on more than one occasion C++ (not C yet) projects where there
    were 2 files only different in case, eg: Network.cpp and network.cpp where the former would be the class and the latter would be procedural support code.
    Good luck unzipping that on Windows or any other case insensitive file system.

    For low-level system software like network functionality that
    would probably anyway not work on Windows in the first place
    without change, independent of the capitalization. (But the
    "case insensitive file system" issues, like the above mentioned
    case inconsistencies, are of course an inherent problem.)

    And there's of course a related problem if we port software with
    longer maximum filename lengths to systems with shorter filename
    lengths.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to David Brown on Wed Apr 30 12:38:16 2025
    On 30.04.2025 12:21, David Brown wrote:
    On 30/04/2025 11:52, Janis Papanagnou wrote:
    On 30.04.2025 11:06, Muttley@DastardlyHQ.org wrote:
    On Wed, 30 Apr 2025 09:45:20 +0200
    David Brown <david.brown@hesbynett.no> wibbled:
    More relevant to this group, it make also be convenient for people
    trying to work with big C code bases that were written on Windows and
    you now want to compile (for whatever target you want) them on Linux.
    I've seen code bases developed on Windows machines where the
    capitalisation of include directives was inconsistent - that works on
    case-insensitive filesystems, but not on case-sensitive systems. (Yes, >>>> I know there are many other ways to deal with such issues, but putting >>>> the source code in a case-insensitive directory on ext4 is one option.) >>>
    I've seen on more than one occasion C++ (not C yet) projects where there >>> were 2 files only different in case, eg: Network.cpp and network.cpp
    where
    the former would be the class and the latter would be procedural
    support code.
    Good luck unzipping that on Windows or any other case insensitive
    file system.

    For low-level system software like network functionality that
    would probably anyway not work on Windows in the first place
    without change, independent of the capitalization. (But the
    "case insensitive file system" issues, like the above mentioned
    case inconsistencies, are of course an inherent problem.)

    And there's of course a related problem if we port software with
    longer maximum filename lengths to systems with shorter filename
    lengths.


    What systems are there now with filename length limits that would ever
    be relevant to hand-typed names?

    Frankly, I've not the least idea what all the systems around nowadays
    do or don't support.

    Filename length limits can
    occasionally be relevant in some contexts (I've seen it in web spiders
    that try to turn complete URL's into a single filenames),

    Files on the Web have often pathological filename lengths. Usually I
    have to rename them; because of convenience WRT length and also to
    give such documents a sensible semantical name.

    but unless you are trying to compile code on DOS,

    Oh, I wasn't at all thinking of DOS, more on (historic) 14 character
    limits on Unix systems.

    Nowadays we may assume larger limits (POSIX 255 characters?). (Or in
    case of 'tar' data transfers yet lower limits [for compatibility].)

    any system will support any length of
    filename that someone would bother typing into an "#include" line.

    If there's any limits it needs to be addressed.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to David Brown on Wed Apr 30 12:15:28 2025
    On 30.04.2025 09:45, David Brown wrote:
    [...]

    Linus Torvalds has just had one of his famous rants in reference to case-insensitive options for Bcachefs :

    [link snipped]

    That reminds be some statements that Arnold Robbins mentioned about
    GNU Awk's 'IGNORECASE' feature. First he reported how many troubles
    that feature inflicted; it had to be considered in a lot of contexts
    and meant a special handling in several places, and that it was an
    extreme effort to get it work right and consistently. And I think he
    also considered removing that feature (but it's a long existing one,
    so...). And, where it's "needed", it's usually also trivial to just
    work around it without 'IGNORECASE'.

    It makes no sense, IMO, to try to stay sort of "compatible" with case-insensitive file-systems.

    Janis

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Muttley@DastardlyHQ.org on Wed Apr 30 12:25:34 2025
    On 30/04/2025 11:06, Muttley@DastardlyHQ.org wrote:
    On Wed, 30 Apr 2025 09:45:20 +0200
    David Brown <david.brown@hesbynett.no> wibbled:
    More relevant to this group, it make also be convenient for people
    trying to work with big C code bases that were written on Windows and
    you now want to compile (for whatever target you want) them on Linux.
    I've seen code bases developed on Windows machines where the
    capitalisation of include directives was inconsistent - that works on
    case-insensitive filesystems, but not on case-sensitive systems. (Yes,
    I know there are many other ways to deal with such issues, but putting
    the source code in a case-insensitive directory on ext4 is one option.)

    I've seen on more than one occasion C++ (not C yet) projects where there
    were 2 files only different in case, eg: Network.cpp and network.cpp where the former would be the class and the latter would be procedural support code.

    I'd question the wisdom of such a convention. I'd rather have clearer separation of the filenames, or perhaps use different directories,
    aiming to make it hard to mix up the names. But maybe it is an
    appropriate choice in some situations - perhaps alternative naming
    schemes were considered worse in other ways.

    Good luck unzipping that on Windows or any other case insensitive file system.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Janis Papanagnou on Wed Apr 30 12:21:51 2025
    On 30/04/2025 11:52, Janis Papanagnou wrote:
    On 30.04.2025 11:06, Muttley@DastardlyHQ.org wrote:
    On Wed, 30 Apr 2025 09:45:20 +0200
    David Brown <david.brown@hesbynett.no> wibbled:
    More relevant to this group, it make also be convenient for people
    trying to work with big C code bases that were written on Windows and
    you now want to compile (for whatever target you want) them on Linux.
    I've seen code bases developed on Windows machines where the
    capitalisation of include directives was inconsistent - that works on
    case-insensitive filesystems, but not on case-sensitive systems. (Yes,
    I know there are many other ways to deal with such issues, but putting
    the source code in a case-insensitive directory on ext4 is one option.)

    I've seen on more than one occasion C++ (not C yet) projects where there
    were 2 files only different in case, eg: Network.cpp and network.cpp where >> the former would be the class and the latter would be procedural support code.
    Good luck unzipping that on Windows or any other case insensitive file system.

    For low-level system software like network functionality that
    would probably anyway not work on Windows in the first place
    without change, independent of the capitalization. (But the
    "case insensitive file system" issues, like the above mentioned
    case inconsistencies, are of course an inherent problem.)

    And there's of course a related problem if we port software with
    longer maximum filename lengths to systems with shorter filename
    lengths.


    What systems are there now with filename length limits that would ever
    be relevant to hand-typed names? Filename length limits can
    occasionally be relevant in some contexts (I've seen it in web spiders
    that try to turn complete URL's into a single filenames), but unless you
    are trying to compile code on DOS, any system will support any length of filename that someone would bother typing into an "#include" line.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to David Brown on Wed Apr 30 12:46:30 2025
    On 30.04.2025 12:25, David Brown wrote:
    On 30/04/2025 11:06, Muttley@DastardlyHQ.org wrote:

    I've seen on more than one occasion C++ (not C yet) projects where there
    were 2 files only different in case, eg: Network.cpp and network.cpp
    where
    the former would be the class and the latter would be procedural
    support code.

    I'd question the wisdom of such a convention. I'd rather have clearer separation of the filenames, or perhaps use different directories,
    aiming to make it hard to mix up the names. But maybe it is an
    appropriate choice in some situations - perhaps alternative naming
    schemes were considered worse in other ways.

    I recall similar situations in our C++ contexts where name suffixes
    were used (like "_util", "_aux", or similar) to disambiguate them.

    I don't think it's a good idea to use the same names and resolve
    them by organizing them through different directories only. I recall
    we had used regularly -I compiler flags and I don't want to imagine
    the hassle with same file names located in different directories.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Muttley@DastardlyHQ.org@21:1/5 to All on Wed Apr 30 12:37:45 2025
    On Wed, 30 Apr 2025 11:52:25 +0200
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wibbled:
    On 30.04.2025 11:06, Muttley@DastardlyHQ.org wrote:
    On Wed, 30 Apr 2025 09:45:20 +0200
    David Brown <david.brown@hesbynett.no> wibbled:
    More relevant to this group, it make also be convenient for people
    trying to work with big C code bases that were written on Windows and
    you now want to compile (for whatever target you want) them on Linux.
    I've seen code bases developed on Windows machines where the
    capitalisation of include directives was inconsistent - that works on
    case-insensitive filesystems, but not on case-sensitive systems. (Yes,
    I know there are many other ways to deal with such issues, but putting
    the source code in a case-insensitive directory on ext4 is one option.)

    I've seen on more than one occasion C++ (not C yet) projects where there
    were 2 files only different in case, eg: Network.cpp and network.cpp where >> the former would be the class and the latter would be procedural support >code.
    Good luck unzipping that on Windows or any other case insensitive file >system.

    For low-level system software like network functionality that
    would probably anyway not work on Windows in the first place

    Huh?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Muttley@DastardlyHQ.org@21:1/5 to All on Wed Apr 30 12:38:53 2025
    On Wed, 30 Apr 2025 12:25:34 +0200
    David Brown <david.brown@hesbynett.no> wibbled:
    On 30/04/2025 11:06, Muttley@DastardlyHQ.org wrote:
    On Wed, 30 Apr 2025 09:45:20 +0200
    David Brown <david.brown@hesbynett.no> wibbled:
    More relevant to this group, it make also be convenient for people
    trying to work with big C code bases that were written on Windows and
    you now want to compile (for whatever target you want) them on Linux.
    I've seen code bases developed on Windows machines where the
    capitalisation of include directives was inconsistent - that works on
    case-insensitive filesystems, but not on case-sensitive systems. (Yes,
    I know there are many other ways to deal with such issues, but putting
    the source code in a case-insensitive directory on ext4 is one option.)

    I've seen on more than one occasion C++ (not C yet) projects where there
    were 2 files only different in case, eg: Network.cpp and network.cpp where >> the former would be the class and the latter would be procedural support >code.

    I'd question the wisdom of such a convention. I'd rather have clearer >separation of the filenames, or perhaps use different directories,
    aiming to make it hard to mix up the names. But maybe it is an
    appropriate choice in some situations - perhaps alternative naming
    schemes were considered worse in other ways.

    Its certainly not a scheme I'd use, but I've also seen Makefile and makefile
    in the same package build directory in the past.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Janis Papanagnou on Wed Apr 30 13:41:31 2025
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 30.04.2025 03:53, Lawrence D'Oliveiro wrote:
    On Tue, 29 Apr 2025 11:17:50 +0300, Michael S wrote:

    z/Os is alive and in good shape, but everybody knows that despite
    the trademark it is not similar to Unix.

    Just goes to show the worthlessness of the “Unix” name nowadays.

    "UNIX" has a meaning that varied historically. But "Unix" is
    commonly used as a name for the family of "UNIX-like" systems;
    that's very useful since it allows to formulate commonalities
    of this OS family.[*]

    Janis

    [*] As we've seen in the discussion of Unix file systems with
    its basic structure of being built by sequences of octets[**]
    and having two distinguished characters '\0' and '/'.

    [**] BTW; does anyone know how e.g. the [historic] Borroughs

    s/Borroughs/Burroughs/
    then
    s/Burroughs/Sperry/

    since only the Sperry systems (UNISYS = Burroughs + Sperry (1985/6))
    have a 36-bit machine. Burroughs had a variety of machines, the early
    ones basically BCD (Electrodata 220, Burroughs 300), then B5000/B5500
    (48-bit) and B3500 (4-bit BCD, 100-digit operands).

    Unix systems with their 9 bit/36 bit architecture had their
    file systems defined (w.r.t. the octet transfer syntax)?

    I was on the Burroughs side, so never got to play with the
    Univac systems.

    FWIW, the Univac systems live on as the Unisys Clearpath Dorado
    and the Burroughs B5500 descendents live on as the Unisys Clearpath
    Libra. Both are emulated on intel CPUs (Libra on Windows Server
    and Dorado on Linux).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Scott Lurndal on Thu May 1 00:15:12 2025
    On 30.04.2025 15:41, Scott Lurndal wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 30.04.2025 03:53, Lawrence D'Oliveiro wrote:
    On Tue, 29 Apr 2025 11:17:50 +0300, Michael S wrote:

    z/Os is alive and in good shape, but everybody knows that despite
    the trademark it is not similar to Unix.

    Just goes to show the worthlessness of the “Unix” name nowadays.

    "UNIX" has a meaning that varied historically. But "Unix" is
    commonly used as a name for the family of "UNIX-like" systems;
    that's very useful since it allows to formulate commonalities
    of this OS family.[*]

    [*] As we've seen in the discussion of Unix file systems with
    its basic structure of being built by sequences of octets[**]
    and having two distinguished characters '\0' and '/'.

    [**] BTW; does anyone know how e.g. the [historic] Borroughs

    s/Borroughs/Burroughs/
    then
    s/Burroughs/Sperry/

    Oh, sorry, I actually made even a more serious mistake beyond a typo;

    s/Borroughs/Honeywell 6000/

    But the question was not so much about the concrete system label but
    the principle question what happens if a system's character width is
    defined as 9 bit, the underlying hardware (like hard disks) probably
    8 bit, and a Unix OS file-system in between.

    Janis

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Muttley on Wed Apr 30 23:56:40 2025
    On Wed, 30 Apr 2025 12:38:53 -0000 (UTC), Muttley wrote:

    Its certainly not a scheme I'd use, but I've also seen Makefile and
    makefile in the same package build directory in the past.

    The GNU “make” command, specified without a filename, looks for “GNUmakefile”, then “Makefile”, then “makefile”. The man page <https://manpages.debian.org/make(1)> says:

    We recommend Makefile because it appears prominently near the
    beginning of a directory listing, right near other important files
    such as README.

    But is this still true for most people? I think the default sort
    settings these days no longer put all-caps names at the top.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lew Pitcher@21:1/5 to Janis Papanagnou on Wed Apr 30 23:49:03 2025
    On Thu, 01 May 2025 00:15:12 +0200, Janis Papanagnou wrote:

    On 30.04.2025 15:41, Scott Lurndal wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 30.04.2025 03:53, Lawrence D'Oliveiro wrote:
    On Tue, 29 Apr 2025 11:17:50 +0300, Michael S wrote:

    z/Os is alive and in good shape, but everybody knows that despite
    the trademark it is not similar to Unix.

    Just goes to show the worthlessness of the “Unix” name nowadays.

    "UNIX" has a meaning that varied historically. But "Unix" is
    commonly used as a name for the family of "UNIX-like" systems;
    that's very useful since it allows to formulate commonalities
    of this OS family.[*]

    [*] As we've seen in the discussion of Unix file systems with
    its basic structure of being built by sequences of octets[**]
    and having two distinguished characters '\0' and '/'.

    [**] BTW; does anyone know how e.g. the [historic] Borroughs

    s/Borroughs/Burroughs/
    then
    s/Burroughs/Sperry/

    Oh, sorry, I actually made even a more serious mistake beyond a typo;

    s/Borroughs/Honeywell 6000/

    But the question was not so much about the concrete system label but
    the principle question what happens if a system's character width is
    defined as 9 bit, the underlying hardware (like hard disks) probably
    8 bit,

    A quick read through the Wikipedia article on the Honeywell 6000 and
    another read through the documentation on the (related) DDS190 disk
    storage unit (see https://www.manualslib.com/manual/1939073/Honeywell-6000-Series.html?page=8#manual)
    indicates that the hard disks used 6-bit characters. That would mean
    that, on disk, you could store a Honeywell 6000 36-bit word as 6 6bit characters (or 2 9bit program characters in 3 6bit storage characters).

    and a Unix OS file-system in between.

    Janis

    [...]




    --
    Lew Pitcher
    "In Skills We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From vallor@21:1/5 to ldo@nz.invalid on Thu May 1 00:57:42 2025
    XPost: comp.misc

    On Wed, 30 Apr 2025 23:56:40 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote in <vuudbo$1ajpm$7@dont-email.me>:

    On Wed, 30 Apr 2025 12:38:53 -0000 (UTC), Muttley wrote:

    Its certainly not a scheme I'd use, but I've also seen Makefile and
    makefile in the same package build directory in the past.

    The GNU “make” command, specified without a filename, looks for “GNUmakefile”, then “Makefile”, then “makefile”. The man page <https://manpages.debian.org/make(1)> says:

    We recommend Makefile because it appears prominently near the
    beginning of a directory listing, right near other important files
    such as README.

    But is this still true for most people? I think the default sort
    settings these days no longer put all-caps names at the top.

    (Setting x-post and followups to comp.misc -- follow or ignore,
    it's up to you...)

    On Linux, try:

    LC_COLLATE=C ls -l

    ...and the capitalized filenames will float to the top.

    (Discovered by logging into my Panix shell and inspecting
    the behavior and settings there.)

    --
    -v System76 Thelio Mega v1.1 x86_64 NVIDIA RTX 3090 Ti
    OS: Linux 6.14.4 Release: Mint 22.1 Mem: 258G

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Lawrence D'Oliveiro on Thu May 1 11:13:15 2025
    On 01/05/2025 01:56, Lawrence D'Oliveiro wrote:
    On Wed, 30 Apr 2025 12:38:53 -0000 (UTC), Muttley wrote:

    Its certainly not a scheme I'd use, but I've also seen Makefile and
    makefile in the same package build directory in the past.

    The GNU “make” command, specified without a filename, looks for “GNUmakefile”, then “Makefile”, then “makefile”. The man page <https://manpages.debian.org/make(1)> says:

    We recommend Makefile because it appears prominently near the
    beginning of a directory listing, right near other important files
    such as README.

    But is this still true for most people? I think the default sort
    settings these days no longer put all-caps names at the top.

    I can't speak for "most people", but since my project directories rarely
    have more than about a dozen files and directories (like "src" and
    "build") in the top directory, it could be called zzzz and still be near
    the top!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Fri May 2 09:52:51 2025
    On 01.05.2025 01:56, Lawrence D'Oliveiro wrote:
    On Wed, 30 Apr 2025 12:38:53 -0000 (UTC), Muttley wrote:

    Its certainly not a scheme I'd use, but I've also seen Makefile and
    makefile in the same package build directory in the past.

    The GNU “make” command, specified without a filename, looks for “GNUmakefile”, then “Makefile”, then “makefile”. The man page <https://manpages.debian.org/make(1)> says:

    We recommend Makefile because it appears prominently near the
    beginning of a directory listing, right near other important files
    such as README.

    But is this still true for most people? I think the default sort
    settings these days no longer put all-caps names at the top.

    I cannot tell for "most people", but it certainly depends on the
    collating order of the locale setting. For example I have a mixed
    setting; my default may be something like "de_DE.UTF-8"[*] but
    some details I've changed to "en_US.utf8", and some to "C.UTF-8";
    the sorting order is defined by (classical) "LC_COLLATE=C.UTF-8".

    Despite using the classical collating order where "Makefile" stands
    out I have the habit to use (for own file organizations) a leading
    underscore to have specific files visually separated in listings.

    Janis

    [*] Where umlauts are sorted within their base characters (Ä with
    A etc.) and comparisons done case-insensitive, so that "Makefile"
    is an ordinary name in between the other file names. (As you say.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lew Pitcher on Fri May 2 10:25:23 2025
    On 01.05.2025 01:49, Lew Pitcher wrote:
    On Thu, 01 May 2025 00:15:12 +0200, Janis Papanagnou wrote:

    On 30.04.2025 15:41, Scott Lurndal wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    On 30.04.2025 03:53, Lawrence D'Oliveiro wrote:
    On Tue, 29 Apr 2025 11:17:50 +0300, Michael S wrote:

    z/Os is alive and in good shape, but everybody knows that despite
    the trademark it is not similar to Unix.

    Just goes to show the worthlessness of the “Unix” name nowadays.

    "UNIX" has a meaning that varied historically. But "Unix" is
    commonly used as a name for the family of "UNIX-like" systems;
    that's very useful since it allows to formulate commonalities
    of this OS family.[*]

    [*] As we've seen in the discussion of Unix file systems with
    its basic structure of being built by sequences of octets[**]
    and having two distinguished characters '\0' and '/'.

    [**] BTW; does anyone know how e.g. the [historic] Borroughs

    s/Borroughs/Burroughs/
    then
    s/Burroughs/Sperry/

    Oh, sorry, I actually made even a more serious mistake beyond a typo;

    s/Borroughs/Honeywell 6000/

    But the question was not so much about the concrete system label but
    the principle question what happens if a system's character width is
    defined as 9 bit, the underlying hardware (like hard disks) probably
    8 bit,

    A quick read through the Wikipedia article on the Honeywell 6000 and
    another read through the documentation on the (related) DDS190 disk
    storage unit (see https://www.manualslib.com/manual/1939073/Honeywell-6000-Series.html?page=8#manual)
    indicates that the hard disks used 6-bit characters.

    Thanks.

    I haven't worked on a Honeywell 6000 myself (just found it listed
    in an old book about Unix). I know that 6 bit characters where quite
    common back these days (supported by 60 bit CDC 175 and 48 bit TR440,
    for example).

    I'm so used to common "contemporary" (late 1970's and later) discs
    that it didn't occur to me that _hard disc_ hardware could be 6 bit
    based.

    That would mean
    that, on disk, you could store a Honeywell 6000 36-bit word as 6 6bit characters (or 2 9bit program characters in 3 6bit storage characters).

    The book I was referring to mentions a 9 bit "C" 'char' type...

    and a Unix OS file-system in between.

    So the question remains how that's mapped onto the Unix file system
    running on the Honeywell 6000 systems. If the OS is written in "C"
    then I'd assume all characters would be 9 bit. The documentation for
    the file system is non-specifically speaking about "bytes" - I assume
    "9 bit bytes" - but I don't see other information about characters
    in the directory. I'd assume it was all 9 bit oriented with the basic convention of the characters '\0' and '/' as the only "limitation"
    of the file system intact?

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Scott Lurndal on Sun May 4 21:47:13 2025
    scott@slp53.sl.home (Scott Lurndal) writes:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 29.04.2025 um 02:28 schrieb Scott Lurndal:

    Bonita Montero <Bonita.Montero@gmail.com> writes:

    Am 28.04.2025 um 20:47 schrieb Richard Harnden:

    On 28/04/2025 19:36, Bonita Montero wrote:

    Am 28.04.2025 um 18:59 schrieb Scott Lurndal:

    Not really. UTF-8 is UTF-8, regardless of the locale.

    But UTF-8 isn't the standard locale for Unix filesystems
    except with macOS.

    UTF-8 isn't a locale - it's an encoding.

    Idiot.
    Type "locale" in the shell and thenn return.

    $ locale
    LANG=C
    LC_CTYPE="C"
    LC_NUMERIC=C
    LC_TIME="C"
    LC_COLLATE="C"
    LC_MONETARY="C"
    LC_MESSAGES="C"
    LC_PAPER="C"
    LC_NAME="C"
    LC_ADDRESS="C"
    LC_TELEPHONE="C"
    LC_MEASUREMENT="C"
    LC_IDENTIFICATION="C"
    LC_ALL=

    For me:

    boni@Raubtier-Asyl:/mnt/c/Users/Boni$ locale
    LANG=C.UTF-8
    LANGUAGE=

    Same locale, different encoding. As has been pointed out
    to you repeatedly.

    I don't want to take sides in this exchange, but I feel obliged to
    point out that, in the terminology of the ISO C standard, having a
    different encoding implies being a different locale. So it might be
    good to specify a frame of reference for where the term "locale" has
    its source, for the various respective comments being offered.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to BGB on Wed May 7 01:19:41 2025
    On 06.05.2025 20:01, BGB wrote:
    [...]

    The partial rationale here being that the directory entries in this case
    were fixed size (like FAT, albeit with longer names), and this could potentially make the difference between using a single directory entry
    or needing a more complex LFN style scheme. Though, in this case, the
    default name length is 48, and it is rare for a filename to not fit into
    48 bytes.

    You mean rare in your application areas?

    This appears to me like a very conservative size. While I'd agree
    that it's probably a sensible value for own files with explicitly
    chosen file names a lot of files that are downloaded regularly do
    have longer file names. A quick check of my "Documents" directory
    (that contains both, downloaded files and own files) shows a ratio
    of 1563:629, i.e. roughly about 30% files of "document" type with
    lengths > 48 (there's no files with a file name length > 128).

    I recall someone here recently spoke about chosen lengths of 255
    (or some such)for file names, which seems to be plenty, OTOH.

    Janis

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to BGB on Wed May 7 14:58:50 2025
    On 07.05.2025 12:08, BGB wrote:
    [...]

    Though, if someone really must make something case-insensitive, a case
    could be made for only supporting it for maybe Latin, Greek, and
    Cyrillic.

    I don't understand what you want to say here; it just sounds strange
    to me. - Mind to elaborate?

    Ideally, this would be better handled in a file-browser or
    similar, and not in the VFS or FS driver itself.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Wed May 7 13:45:35 2025
    BGB <cr88192@gmail.com> writes:
    On 5/6/2025 6:19 PM, Janis Papanagnou wrote:
    On 06.05.2025 20:01, BGB wrote:
    [...]

    The partial rationale here being that the directory entries in this case >>> were fixed size (like FAT, albeit with longer names), and this could
    potentially make the difference between using a single directory entry
    or needing a more complex LFN style scheme. Though, in this case, the
    default name length is 48, and it is rare for a filename to not fit into >>> 48 bytes.

    You mean rare in your application areas?

    This appears to me like a very conservative size. While I'd agree
    that it's probably a sensible value for own files with explicitly
    chosen file names a lot of files that are downloaded regularly do
    have longer file names. A quick check of my "Documents" directory
    (that contains both, downloaded files and own files) shows a ratio
    of 1563:629, i.e. roughly about 30% files of "document" type with
    lengths > 48 (there's no files with a file name length > 128).

    I recall someone here recently spoke about chosen lengths of 255
    (or some such)for file names, which seems to be plenty, OTOH.



    Running quick/dirty stats of everything on my "K:" drive, roughly 2
    million files of various assorted types.

    Stats (file names less than N bytes):
    16: 66.40%
    24: 87.85%
    32: 95.38%
    48: 99.31%

    Are you considering the entire path, or just the final component?

    Full paths will exceed the 48 bytes frequently, this, for example
    is 142 bytes.

    /work/music/Blues/Howlin' Wolf/The Chess Box (1963 - 1973) (disc 3)/20 - The Red Rooster (London Sessions w false start and dialog) (1970).wav

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Janis Papanagnou on Wed May 7 20:24:30 2025
    On Wed, 7 May 2025 14:58:50 +0200
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    On 07.05.2025 12:08, BGB wrote:
    [...]

    Though, if someone really must make something case-insensitive, a
    case could be made for only supporting it for maybe Latin, Greek,
    and Cyrillic.

    I don't understand what you want to say here; it just sounds strange
    to me. - Mind to elaborate?


    Latin, Greek and Cyrillic are three Bicameral scripts that
    - have long history
    - widely used today
    - have simple mostly unambiguous relationships between upper and lower
    cases.
    AFAIK, the only other script that shares all of these properties is
    Armenian. I would think that BGB just forgot about it and that he
    would agree that Armenian belong with other three.

    Support for case-insensitivity for the rest of bicameral scripts in
    the Unicode would be harder either because of immaturity of some of
    scripts or because nobody uses them any longer so it would be hard to
    find the authority in case of confusion.

    Ideally, this would be better handled in a file-browser or
    similar, and not in the VFS or FS driver itself.

    Janis


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Wed May 7 18:57:58 2025
    BGB <cr88192@gmail.com> writes:
    On 5/7/2025 8:45 AM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 5/6/2025 6:19 PM, Janis Papanagnou wrote:
    On 06.05.2025 20:01, BGB wrote:
    [...]

    These component names are what the filesystem actually stores.


    Combined path length can be somewhat longer.
    A traditional limit is 260 chars though.

    There wasn't a well defined path-length limit in my projects, though >informally typically somewhere between 256 and 768 bytes.

    It is rare to see a path over 256, but, "if doing it properly" a
    consistent length limit of, say, 512 or 768 would make sense. Going any >bigger is likely needlessly overkill.


    Full paths will exceed the 48 bytes frequently, this, for example
    is 142 bytes.

    /work/music/Blues/Howlin' Wolf/The Chess Box (1963 - 1973) (disc 3)/20 - The Red Rooster (London Sessions w false start and dialog) (1970).wav

    We don't usually store full paths in a filesystem, as each directory
    exists as its own entity.

    So, say, if broken down:
    5 chars ("work")
    6 chars ("music")
    6 chars
    13 chars
    37 chars
    75 chars.

    Longest name counted here would be 75.

    It doesn't matter from the application perspective, an
    application needs to be written to support the largest
    path length an implementation allows (e.g. POSIX PATH_MAX).

    Very few applications actually walk directories and thus
    don't particularly notice per-directory limits.

    POSIX has NAME_MAX as the maximum size that an implementation
    supports (POSIX also defines the minimum value that an
    implementation must support, which is 14 for NAME_MAX).

    14 bytes was the maximum filename length for the original unix
    filesystems (later known as S5).

    $ getconf -a | grep NAME_MAX
    NAME_MAX 255
    _POSIX_NAME_MAX 255

    $ getconf -a | grep PATH_MAX
    PATH_MAX 4096
    _POSIX_PATH_MAX 4096



    It is not all that uncommon for directories to get larger than a limit
    where linear search is inefficient.

    Large directories are not generally considered a good
    idea for multiple reasons.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to BGB on Wed May 7 23:03:44 2025
    On 07/05/2025 20:26, BGB wrote:
    On 5/7/2025 7:58 AM, Janis Papanagnou wrote:
    On 07.05.2025 12:08, BGB wrote:
    [...]

    Though, if someone really must make something case-insensitive, a case
    could be made for only supporting it for maybe Latin, Greek, and
    Cyrillic.

    I don't understand what you want to say here; it just sounds strange
    to me. - Mind to elaborate?


    Latin, Greek, and Cyrillic, are the main alphabets which actually have a useful and reasonably well defined concept of "case", and thus "case
    folding" actually makes sense for these.

    For most other places, it does not, and one can likely ignore rules for things outside of these alphabets. Can eliminate a bunch of rules for alphabets that don't actually have "case" as we would understand it.


    By limiting rules in these ways, a simpler and more manageable set of
    rules is possible. Vs, say, actual Unicode rules, which tend to have
    stuff going on all over the place.


    Ligatures pose an issue though, but presumably option is one of:
      Case fold between ligatures, when both variants exist;
      Treat the ligature as its own character;
      Decompose and compare.


    Though, FWIW, in my normalization code, I mostly ignored ligatures, as
    while they could be decomposed in many cases, they could only be
    recomposed for locales that actually use said ligature (like, in
    English, if AE and IJ started spontaneously merging into new characters,
    this would be weird and out of place; and having a filesystem layer that merely decomposed any ligatures it encountered would not be ideal).


    Ideally, this would be better handled in a file-browser or
    similar, and not in the VFS or FS driver itself.

    Janis



    No matter how you choose to do it, you will get it wrong sometimes. Case-insensitive comparison has language-specific details in addition to
    the character in the Unicode tables. Should the lower-case version of
    "SS" be "ss" or "ß" ? That depends on the language and the position of
    the letters. Should the capital of "ß" be "SS" or "ẞ"? Should the
    capital of "i" be "I" or "İ" ? Some languages have a letter "dz" - some
    of those capitalise it as "DZ", others as "Dz".

    About the only case-normalisation you can reasonably do without risk of
    getting things wrong (except for the Turkish i/ı) is for the plain 26
    letters in ASCII. For everything else you would provide little of help
    to anyone, and mistakes for some languages. Case normalisation, like
    ordering, is language-dependent and does not belong in a filesystem or
    other low-level parts of a system.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Thu May 8 01:08:36 2025
    On Wed, 7 May 2025 13:26:49 -0500, BGB wrote:

    Ligatures pose an issue though ...

    Ligatures are a rendering issue. Leave them out of the text encoding as
    far as possible.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Thu May 8 01:07:54 2025
    On Wed, 7 May 2025 05:08:03 -0500, BGB wrote:

    Ideally, filesystems should be case sensitive by default;
    If someone wants case insensitivity, this can be better handled at the application or file-browser level.

    Even Linux has given in on this. The widely-used ext4 filesystem has an
    option for case-insensitivity, which, once enabled for a volume, can be activated on a per-directory basis.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to BGB on Thu May 8 13:13:08 2025
    On 08.05.2025 05:30, BGB wrote:
    [...]

    Though, even for the Latin alphabet, once one goes much outside of ASCII
    and Latin-1, it gets messy.

    I noticed that in several places you were referring to Latin-1. Since
    decades that has been replaced by the Latin-9 (ISO 8859-15) character
    set[*] for practical reasons ('€' sign, for example).

    Why is your focus still on the old Latin-1 (ISO 8859-1) character set?

    Janis, just curious

    [*] Unless Unicode and its encodings are used.

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to David Brown on Thu May 8 12:52:36 2025
    On 07.05.2025 23:03, David Brown wrote:
    [...]

    No matter how you choose to do it, you will get it wrong sometimes. Case-insensitive comparison has language-specific details in addition to
    the character in the Unicode tables. Should the lower-case version of
    "SS" be "ss" or "ß" ? That depends on the language and the position of
    the letters. Should the capital of "ß" be "SS" or "ẞ"? [...]

    Concerning file system's named directory entries - and I'm speaking
    for German here - an "ß" is not the same as an "ss" or "sz" ligature.
    An "ss" for example may either be part of an ordinary word or just a "replacement representation" (I'm lacking the correct English term)
    for the 'Sharp S'. So the letters shall not be substituted by other
    letters (or letter sequences) but always taken literally. So an "SS"
    shall always be "ss". Historically the only relevant problem was the
    'Lower Case Sharp S' where there had been no 'Upper Case Sharp S',
    but that situation has changed (not too long ago).

    The main point of issues with "flattening" cases in case-insensitive (file-)systems is still undisputed [by me] of course.

    Janis

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Fri May 9 02:26:15 2025
    On Thu, 8 May 2025 18:50:33 -0500, BGB wrote:

    But, I don't bother with C1 control codes, as they are unused ...

    Mostly true. But I think terminal emulators do interpret CSI as equivalent
    to ESC followed by “[”.

    In some contexts, may or may not also have ANSI escape sequences, though generally no text editors deal with or make use of ANSI escapes.

    Editors (and other apps) running in “full-screen” mode within a terminal emulator would use them to control the display.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri May 9 02:24:09 2025
    On Wed, 7 May 2025 20:24:30 +0300, Michael S wrote:

    AFAIK, the only other script that shares all of these properties is
    Armenian.

    Isn’t it true, though, that uppercase Armenian script is only of
    historical significance these days, and that all their text is normally
    written using only the lowercase letters?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Fri May 9 02:22:59 2025
    On Thu, 8 May 2025 01:57:05 -0500, BGB wrote:

    Either way, case-insensitivity at the FS level adds complexity.

    If you look around some other groups, you will see discussion of a recent
    rant from Linus Torvalds on this very issue. Basically, he doesn’t like case-insensitivity. And he is justified in pointing out that it leads to
    more opportunities for bugs in the kernel code. The only reason we need to
    have it is because it makes certain things easier for users.

    I guess, one intermediate option could be to keep the FS proper as case sensitive, but then fake case insensitivity at the level of the OS APIs (based on a system-level locale setting).

    There is a standard Unicode locale-independent case-folding algorithm.
    That is what Linux implements. At the time of volume initialization, it
    only involves setting one filesystem parameter, which says to assume that
    all filenames are UTF-8-encoded.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Keith Thompson on Fri May 9 11:45:44 2025
    On 09.05.2025 02:19, Keith Thompson wrote:
    BGB <cr88192@gmail.com> writes:
    [...]
    [...]

    The Latin-1 8-bit character set is largely obsolete. Whatever point
    you're making, I suspect you could make it much more clearly without
    any reference to Latin-1 or Windows-1252.

    Indeed.

    [...]
    [...]

    It is 8-bit and byte-based, and informally I think, most
    extended-ASCII codepages were collectively known as ASCII even if only
    the low 7-bit range is ASCII proper (and I think more for sake of
    contrast with "Not Unicode", eg, UTF-8 / UTF-16 / UCS-2 / ...).

    No, 8-bit character sets are not ASCII. Calling them "extended ASCII"
    is reasonable.

    Back then when the ASCII character set got extended it may have been
    sensible (for a short period of time!) to use an informal term like
    "extended ASCII" but with the many different extensions it's IMO not
    reasonable any more to do so, since it inflicts more confusion than
    make intentions clear.

    Also "extended ASCII" sounds (in my ears) like a determined character
    set (as opposed to "_an_ ASCII extension"). Unicode is [roughly] also
    an ASCII extension.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Keith Thompson on Fri May 9 12:11:31 2025
    On 09.05.2025 04:56, Keith Thompson wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    In a "C" file (of the Kornshell software) I stumbled across this
    comment: "Each command in the history file starts on an even byte
    and is null-terminated."

    I wonder what's the reason behind that even-byte-alignment, on "C"
    level or on Unix/files level. Any ideas?

    This question led to a number of long digressions, most of which didn't address the original question.

    Indeed. (And also not surprising.) But the earlier on-topic answers
    and thoughts already satisfied my curiosity. And the (not unusual)
    off-topic digressions were not uninteresting [to me].


    The quoted comment is in src/cmd/ksh93/edit/history.c in <https://github.com/ksh93/ksh>. It goes on to mention versions 0
    and 1 of the history file format.

    I haven't been able to find sources for ksh that would shed any light on this.

    The even byte requirement in version 1 was likely inherited from version
    0. The initial commit in the git repo includes release notes going back
    to 1987, but no old versions of the source code.

    My best guess is that the author of some early version of ksh, when
    first defining the Version 0 history file format, just thought that even
    byte alignment was a good idea at the time. There might not be any
    deeper reason than that.

    Yeah, probably. (One suggestion was about support of extensions for
    16 bit character sets, IIRC.)

    With the inspection of the source code I've a good feeling that the
    syntax won't reveal any unexpected annoyances or surprises; which was
    my primary concern.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lawrence D'Oliveiro on Fri May 9 11:59:50 2025
    On 09.05.2025 04:22, Lawrence D'Oliveiro wrote:
    On Thu, 8 May 2025 01:57:05 -0500, BGB wrote:

    Either way, case-insensitivity at the FS level adds complexity.

    If you look around some other groups, you will see discussion of a recent rant from Linus Torvalds on this very issue. Basically, he doesn’t like case-insensitivity. And he is justified in pointing out that it leads to
    more opportunities for bugs in the kernel code.

    (A similar comment exists, I think from Arnold Robbins, with Awk's
    IGNORECASE feature.)

    The only reason we need to
    have it is because it makes certain things easier for users.

    Well, this may be true - and I suppose it is. But we should not give
    the impression that it's a "pro" for some users and there's no "con"
    [for all users] with such interfaces and system behaviors. Regularly
    I'm cussing, for example, if I'm searching for phrases on interfaces
    that neither support case-sensitivity nor regular expressions.

    But the biggest problem is probably when interoperability comes into
    play.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Fri May 9 14:09:01 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Thu, 8 May 2025 01:57:05 -0500, BGB wrote:

    Either way, case-insensitivity at the FS level adds complexity.

    If you look around some other groups, you will see discussion of a recent >rant from Linus Torvalds on this very issue. Basically, he doesn’t like >case-insensitivity. And he is justified in pointing out that it leads to
    more opportunities for bugs in the kernel code.

    And potential security issues.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Keith Thompson on Fri May 9 14:13:57 2025
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
    In a "C" file (of the Kornshell software) I stumbled across this
    comment: "Each command in the history file starts on an even byte
    and is null-terminated."

    I wonder what's the reason behind that even-byte-alignment, on "C"
    level or on Unix/files level. Any ideas?

    Janis

    Note: Since it's not a shell question but more of a programming or
    platform related question I try to get the answer here (and not in
    comp.unix.shell); just saying to prevent distracting calls to order.

    This question led to a number of long digressions, most of which didn't >address the original question.

    The quoted comment is in src/cmd/ksh93/edit/history.c in ><https://github.com/ksh93/ksh>. It goes on to mention versions 0
    and 1 of the history file format.

    I haven't been able to find sources for ksh that would shed any light on >this.

    The even byte requirement in version 1 was likely inherited from version
    0. The initial commit in the git repo includes release notes going back
    to 1987, but no old versions of the source code.

    My best guess is that the author of some early version of ksh, when
    first defining the Version 0 history file format, just thought that even
    byte alignment was a good idea at the time. There might not be any
    deeper reason than that.

    Which would have been David Korn (now 81), of course. The earliest sources
    I have at hand are for SVR4.2 ES/MP from approximately 1993.

    The primary difference between version 0 and version 1 was support
    for 8-bit characters.

    /*
    * Each command in the history file starts on an even byte is null terminated.
    * The first byte must contain the special character H_UNDO and the second
    * byte is the version number. The sequence H_UNDO 0, following a command,
    * nullifies the previous command. A six byte sequence starting with
    * H_CMDNO is used to store the command number so that it is not necessary
    * to read the file from beginning to end to get to the last block of
    * commands. This format of this sequence is different in version 1
    * then in version 0. Version 1 allows commands to use the full 8 bit
    * character set. It can understand version 0 format files.
    */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Wed May 14 07:07:03 2025
    On Fri, 9 May 2025 19:52:45 +0200, Bonita Montero wrote:

    Unicode hasn't locales ...

    You seem to have a serious misunderstanding of both Unicode and the
    concept of locales.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Thu May 15 07:33:07 2025
    On Fri, 9 May 2025 12:50:10 -0500, BGB wrote:

    On 5/8/2025 9:26 PM, Lawrence D'Oliveiro wrote:

    On Thu, 8 May 2025 18:50:33 -0500, BGB wrote:

    But, I don't bother with C1 control codes, as they are unused ...

    Mostly true. But I think terminal emulators do interpret CSI as
    equivalent to ESC followed by “[”.

    Possibly, though generally, ESC+[ is used IME.

    Actually, several other C1 controls are also defined as equivalents to sequences beginning with ESC.

    Also creates uncertainty, as AFAIK the terminals traditionally operate
    on raw bytes regarding ANSI commands, whereas if the terminal interface
    is UTF-8, a CSI (as a 2-byte encoding) would not be equivalent to 0x9B
    (if encoded as a single byte).

    Yeah, I just checked KDE Konsole, and it doesn’t interpret 0x9B (CSI) as equivalent to 0x1B followed by “[”.

    I suppose I should check if changing the encoding makes any difference to
    this ...

    I was thinking here more of a GUI based editor or pseudo-word processor; where Text + ANSI codes could, in theory, serve a similar role to the
    RTF format, although more as extended text rather than a sort of markup language (though, modern word processors typically use XML internally,
    as opposed to the more unusual markup scheme that RTF had used).

    There’s an old thing called “sixel graphics”, which DEC invented back in the day. I found out KDE Konsole supports it! I think some other terminal emulators do, too. There is a libsixel library that allows converting
    image formats. You only get 256 colours maximum, but that is still
    potentially quite useful.

    Sometimes, it would also be nice if there was a sort of a standalone graphical viewer/editor that used MediaWiki or Markdown or AsciiDoc or similar.

    pandoc -f markdown -t pdf «infile» | okular - &

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)