Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 28 |
Nodes: | 6 (0 / 6) |
Uptime: | 47:41:35 |
Calls: | 422 |
Files: | 1,024 |
Messages: | 90,399 |
In a "C" file (of the Kornshell software) I stumbled across this
comment: "Each command in the history file starts on an even byte
and is null-terminated."
I wonder what's the reason behind that even-byte-alignment, on "C"
level or on Unix/files level. Any ideas?
scott@slp53.sl.home (Scott Lurndal) writes:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
In a "C" file (of the Kornshell software) I stumbled across this
comment: "Each command in the history file starts on an even byte
and is null-terminated."
I wonder what's the reason behind that even-byte-alignment, on "C"
level or on Unix/files level. Any ideas?
Possibly to support 16-bit character sets?
I don't think it supports 16-bit character sets.
Unlike bash history files, which are plain text, ksh history files
are in a binary format.
I don't know whether the format includes any multi-byte integers.
If it does, reading such values directly into memory might be easier
on some platforms if they're aligned.
The relevant source file is src/cmd/ksh93/edit/history.c, in <https://github.com/ksh93/ksh>. It has functions to manipulate the
history file, but I don't see a full description of the file format.
In a "C" file (of the Kornshell software) I stumbled across this
comment: "Each command in the history file starts on an even byte
and is null-terminated."
On 2025-04-26, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
In a "C" file (of the Kornshell software) I stumbled across this[...]
comment: "Each command in the history file starts on an even byte
and is null-terminated."
The alignment could be of help if you're looking at the file
with "od -tx2a".
scott@slp53.sl.home (Scott Lurndal) writes:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
In a "C" file (of the Kornshell software) I stumbled across this
comment: "Each command in the history file starts on an even byte
and is null-terminated."
I wonder what's the reason behind that even-byte-alignment, on "C"
level or on Unix/files level. Any ideas?
Possibly to support 16-bit character sets?
I don't think it supports 16-bit character sets.
Unlike bash history files, which are plain text, ksh history files
are in a binary format.
I don't know whether the format includes any multi-byte integers.
If it does, reading such values directly into memory might be easier
on some platforms if they're aligned.
The relevant source file is src/cmd/ksh93/edit/history.c, in ><https://github.com/ksh93/ksh>. It has functions to manipulate the
history file, but I don't see a full description of the file format.
Am 26.04.2025 um 17:00 schrieb Scott Lurndal:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
In a "C" file (of the Kornshell software) I stumbled across this
comment: "Each command in the history file starts on an even byte
and is null-terminated."
I wonder what's the reason behind that even-byte-alignment, on "C"
level or on Unix/files level. Any ideas?
Possibly to support 16-bit character sets?
Unix has a big problem that it doesn't support 16 bit character sets.
Win32 supported UCS-2 from the beginning and UTF-16 afaik since Windows
2000.
With Unix there's even not a standard charset for the filesystem;
each filename character is just an octet.
Am 26.04.2025 um 17:00 schrieb Scott Lurndal:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
In a "C" file (of the Kornshell software) I stumbled across this
comment: "Each command in the history file starts on an even byte
and is null-terminated."
I wonder what's the reason behind that even-byte-alignment, on "C"
level or on Unix/files level. Any ideas?
Possibly to support 16-bit character sets?
Unix has a big problem that it doesn't support 16 bit character sets.
Win32 supported UCS-2 from the beginning and UTF-16 afaik since Windows
2000. With Unix there's even not a standard charset for the filesystem;
each filename character is just an octet.
Am 26.04.2025 um 17:00 schrieb Scott Lurndal:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
In a "C" file (of the Kornshell software) I stumbled across this
comment: "Each command in the history file starts on an even byte
and is null-terminated."
I wonder what's the reason behind that even-byte-alignment, on "C"
level or on Unix/files level. Any ideas?
Possibly to support 16-bit character sets?
Unix has a big problem that it doesn't support 16 bit character sets.
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 26.04.2025 um 17:00 schrieb Scott Lurndal:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
In a "C" file (of the Kornshell software) I stumbled across this
comment: "Each command in the history file starts on an even byte
and is null-terminated."
I wonder what's the reason behind that even-byte-alignment, on "C"
level or on Unix/files level. Any ideas?
Possibly to support 16-bit character sets?
Unix has a big problem that it doesn't support 16 bit character sets.
X/Open would argue that your statement is 100% false, as unix multibyte >character sets (wchar_t, for example) have been around for three decades.
Am 27.04.2025 um 21:55 schrieb Janis Papanagnou:
I think we have to distinguish the technical base size, an octet,
from the actual filenames. My Linux has no problem to represent,
say, filenames in Chinese or German umlaut characters that require
for representation 2 octets.
You're joking. Which applications currently can handle more than
a 7 bit characters with Unix files ?
Unix has a big problem that it doesn't support 16 bit character sets.
Win32 supported UCS-2 from the beginning and UTF-16 afaik since Windows
2000.
Am 28.04.2025 um 03:22 schrieb Lawrence D'Oliveiro:
Unfortunately, Windows has had to deal with the UCS-2→UTF-16 encoding
kludge ever since then. ...
That's not true. The codepoints for the surrogates were unused before.
Am 28.04.2025 um 03:21 schrieb vallor:filesystem-unicode-aware
On Mon, 28 Apr 2025 02:53:45 +0200, Bonita Monterohttps://stackoverflow.com/questions/38948141/how-are-linux-shells-and-
<Bonita.Montero@gmail.com> wrote in
<vumjhf$20u1e$1@raubtier-asyl.eternal-september.org>:
Am 27.04.2025 um 21:55 schrieb Janis Papanagnou:
I think we have to distinguish the technical base size, an octet,
from the actual filenames. My Linux has no problem to represent, say,
filenames in Chinese or German umlaut characters that require for
representation 2 octets.
You're joking. Which applications currently can handle more than a 7
bit characters with Unix files ?
_[/home/vallor/tmp]_(vallor@lm)🐧_
$ touch 調和 _[/home/vallor/tmp]_(vallor@lm)🐧_
$ ls 調和 _[/home/vallor/tmp]_(vallor@lm)🐧_
$ ls -l total 0 -rw-rw-r-- 1 vallor vallor 0 Apr 27 17:59 調和
ObC (What did I mess up here?):
$ cat readit.c #include <stdio.h>
#include <sys/types.h>
#include <dirent.h>
#include <string.h>
int main(void)
{
DIR * this = {0};
struct dirent * entry = {0};
char * s;
this = opendir(".");
while ((entry = readdir(this))!=NULL)
{
if(!strcmp(entry->d_name,".")) continue;
if(!strcmp(entry->d_name,"..")) continue;
for(s = entry->d_name; *s ; s++)
{ printf("%x\n",*s);
}
puts("---");
}
return 0;
}
Am 27.04.2025 um 21:55 schrieb Janis Papanagnou:
I think we have to distinguish the technical base size, an octet,
from the actual filenames. My Linux has no problem to represent,
say, filenames in Chinese or German umlaut characters that require
for representation 2 octets.
You're joking. Which applications currently can handle more than
a 7 bit characters with Unix files ?
Am 28.04.2025 um 06:55 schrieb vallor:
On Mon, 28 Apr 2025 06:28:44 +0200, Bonita Montero
<Bonita.Montero@gmail.com> wrote in
https://stackoverflow.com/questions/38948141/how-are-linux-shells-and-
filesystem-unicode-aware
I don't see your point. Could I ask you to elaborate?
There's no standardized charset for Unix filesystems beyond 7 bit ASCII.
If you store chars >= 128 in one application they may become different
chars in another.
Am 28.04.2025 um 09:42 schrieb Janis Papanagnou:
Why are you repeatedly saying that; it's not true, and examples have
been provided. If applications are locale-aware - which is standard
for a long time - you can consistently use what you like.
There's no standard locale for a filesystem.
Linux sucks with that.
ObC (What did I mess up here?):
Am 28.04.2025 um 09:42 schrieb Janis Papanagnou:
Why are you repeatedly saying that; it's not true, and examples have
been provided. If applications are locale-aware - which is standard
for a long time - you can consistently use what you like.
There's no standard locale for a filesystem. Linux sucks with that.
Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:
*nix doesn't care about locales for most things including filenames, its
all just a sequence of bytes. Locales only matter for display such as
terminal char sets and dates.
Yes, Unix-APIs are really achaic. When you have a filename written
with ohne user's locale and another with a different locale reads
that he get's at most a partitially readable filename. For Janis
this seems to be flexibility, but for me that's a problem. A file-
system should have fixed charset, at best Unicode.
Am 28.04.2025 um 13:01 schrieb Muttley@DastardlyHQ.org:
I'd say logical. Why should the OS give a damn what locale the user is using >> and hence the filename any more than it should care about whats inside the >> file?
To have filenames displayed the same way no matter what locale is
currently configured.
How often would there be users using different locales on the same machine?
With Unix there's no locale defined for filesystem operations; it's >arbitrary.
Am 28.04.2025 um 10:08 schrieb Janis Papanagnou:
My file system (and obviously also the file systems of others that
are posting here) have no problems with any locale.
That's the problem: the filesystem should have a specific locale.
Otherwise you copy some files from a different computer where the
user has a different locale and you get Swahili-filenames.
Am 28.04.2025 um 11:39 schrieb Bonita Montero:
Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:
Yes, Unix-APIs are really achaic. When you have a filename written
with ohne user's locale and another with a different locale reads
that he get's at most a partitially readable filename. For Janis
this seems to be flexibility, but for me that's a problem. A file-
system should have fixed charset, at best Unicode.
I did have a look at how macOS / APFS handles this:
for macOS all filenames are UTF-8.
Am 28.04.2025 um 13:30 schrieb Bonita Montero:
How often would there be users using different locales on the same
machine?
With Unix there's no locale defined for filesystem operations; it's
arbitrary.
And imagine that you have a tar-archive packed by someone with a
different locale; that's rather likely.
Am 28.04.2025 um 16:21 schrieb Scott Lurndal:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 10:08 schrieb Janis Papanagnou:
My file system (and obviously also the file systems of others that
are posting here) have no problems with any locale.
That's the problem: the filesystem should have a specific locale.
Otherwise you copy some files from a different computer where the
user has a different locale and you get Swahili-filenames.
nonsense.
No nonsense. If you create some files with extended chars and pack
them into a tar-file and unpack them on a different machine with
a different locale you see the wrong charaters.
Am 28.04.2025 um 16:24 schrieb Scott Lurndal:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 11:39 schrieb Bonita Montero:
Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:
Yes, Unix-APIs are really achaic. When you have a filename written
with ohne user's locale and another with a different locale reads
that he get's at most a partitially readable filename. For Janis
this seems to be flexibility, but for me that's a problem. A file-
system should have fixed charset, at best Unicode.
I did have a look at how macOS / APFS handles this:
for macOS all filenames are UTF-8.
No, unix (and macOS _is_ unix) filenames are a simple stream of
bytes with no meaning or semantic associated with the bytes other than the >> terminating nul character and the directory separator character.
The Wikipedia says that APFS is UTF-8 capable. >https://en.wikipedia.org/wiki/Apple_File_System
Am 28.04.2025 um 16:30 schrieb Scott Lurndal:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 13:30 schrieb Bonita Montero:
How often would there be users using different locales on the same
machine?
With Unix there's no locale defined for filesystem operations; it's
arbitrary.
And imagine that you have a tar-archive packed by someone with a
different locale; that's rather likely.
The data in the tar archive is locale-independent. Always.
The filenames not if they contain characters >= 128.
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 16:24 schrieb Scott Lurndal:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 11:39 schrieb Bonita Montero:
Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:
Yes, Unix-APIs are really achaic. When you have a filename
written with ohne user's locale and another with a different
locale reads that he get's at most a partitially readable
filename. For Janis this seems to be flexibility, but for me
that's a problem. A file- system should have fixed charset, at
best Unicode.
I did have a look at how macOS / APFS handles this:
for macOS all filenames are UTF-8.
No, unix (and macOS _is_ unix) filenames are a simple stream of
bytes with no meaning or semantic associated with the bytes other
than the terminating nul character and the directory separator
character.
The Wikipedia says that APFS is UTF-8 capable. >https://en.wikipedia.org/wiki/Apple_File_System
So is linux. The operating system ascribes no meaning to the bytes
stored in the filesystem directories. They're just a stream of
bytes.
One can treat them as UTF-8, which is generally the case. In which
case your objections about 'garbage' in a different locale are
pointless. UTF-8 fonts are universal. The currently locale doesn't
matter.
Windows, on the other hand, limits the character set to those that can
be described in 16-bit units, and the "locale" matters for not only
display purposes, but also for character processing.
On Mon, 28 Apr 2025 17:03:46 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 16:24 schrieb Scott Lurndal:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 11:39 schrieb Bonita Montero:
Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:
Yes, Unix-APIs are really achaic. When you have a filename
written with ohne user's locale and another with a different
locale reads that he get's at most a partitially readable
filename. For Janis this seems to be flexibility, but for me
that's a problem. A file- system should have fixed charset, at
best Unicode.
I did have a look at how macOS / APFS handles this:
for macOS all filenames are UTF-8.
No, unix (and macOS _is_ unix) filenames are a simple stream of
bytes with no meaning or semantic associated with the bytes other
than the terminating nul character and the directory separator
character.
The Wikipedia says that APFS is UTF-8 capable.
https://en.wikipedia.org/wiki/Apple_File_System
So is linux. The operating system ascribes no meaning to the bytes
stored in the filesystem directories. They're just a stream of
bytes.
That's nonsense.
Every case-preserving case-insensitive file system has to understand >characters encoding, at least to a certain degree.
Am 28.04.2025 um 10:08 schrieb Janis Papanagnou:
My file system (and obviously also the file systems of others that
are posting here) have no problems with any locale.
That's the problem: the filesystem should have a specific locale.
Otherwise you copy some files from a different computer where the
user has a different locale and you get Swahili-filenames.
The historic architecture of Linux file systems is able to represent
files having file names in arbitrary languages. That's why the Unix
file systems don't show the issues that other (popular) OSes show.
Windows only has UTF-16-filenames and nov varying locale.
Generally, and specifically if you choose to use international
characters for file names, the prevalent and nowadays the de facto
standard is to use an UTF-8 encoding.
Otherwise you copy some files from a different computer where the
user has a different locale and you get Swahili-filenames.
On 28.04.2025 11:10, Bonita Montero wrote:
Am 28.04.2025 um 10:08 schrieb Janis Papanagnou:
My file system (and obviously also the file systems of others that
are posting here) have no problems with any locale.
That's the problem: the filesystem should have a specific locale.
Otherwise you copy some files from a different computer where the
user has a different locale and you get Swahili-filenames.
Okay, I think I see where you're coming from.
On Mon, 28 Apr 2025 17:03:46 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 16:24 schrieb Scott Lurndal:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 11:39 schrieb Bonita Montero:
Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:
Yes, Unix-APIs are really achaic. When you have a filename
written with ohne user's locale and another with a different
locale reads that he get's at most a partitially readable
filename. For Janis this seems to be flexibility, but for me
that's a problem. A file- system should have fixed charset, at
best Unicode.
I did have a look at how macOS / APFS handles this:
for macOS all filenames are UTF-8.
No, unix (and macOS _is_ unix) filenames are a simple stream of
bytes with no meaning or semantic associated with the bytes other
than the terminating nul character and the directory separator
character.
The Wikipedia says that APFS is UTF-8 capable.
https://en.wikipedia.org/wiki/Apple_File_System
So is linux. The operating system ascribes no meaning to the bytes
stored in the filesystem directories. They're just a stream of
bytes.
That's nonsense.
Every case-preserving case-insensitive file system has to understand characters encoding, at least to a certain degree.
Apple file systems can be configured to be case-sensitive, but it's not default and recommended for none-specialist users.
Am 28.04.2025 um 18:59 schrieb Scott Lurndal:
Not really. UTF-8 is UTF-8, regardless of the locale.
But UTF-8 isn't the standard locale for Unix filesystems
except with macOS.
On 2025-04-28, Michael S <already5chosen@yahoo.com> wrote:
That's nonsense.
Every case-preserving case-insensitive file system has to understand characters encoding, at least to a certain degree.
Commonly used filesystems on Linux are case sensitive.
On Mon, 28 Apr 2025 18:28:46 -0000 (UTC)
Kaz Kylheku <643-408-1753@kylheku.com> wrote:
On 2025-04-28, Michael S <already5chosen@yahoo.com> wrote:
That's nonsense.
Every case-preserving case-insensitive file system has to understand
characters encoding, at least to a certain degree.
Commonly used filesystems on Linux are case sensitive.
Scott's claim was about *all* Unixes and in his previous message he emphasized that he classifies Apple OS/X as Unix.
On Mon, 28 Apr 2025 17:03:46 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 16:24 schrieb Scott Lurndal:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 11:39 schrieb Bonita Montero:
Am 28.04.2025 um 11:31 schrieb Muttley@DastardlyHQ.org:
Yes, Unix-APIs are really achaic. When you have a filename
written with ohne user's locale and another with a different
locale reads that he get's at most a partitially readable
filename. For Janis this seems to be flexibility, but for me
that's a problem. A file- system should have fixed charset, at
best Unicode.
I did have a look at how macOS / APFS handles this:
for macOS all filenames are UTF-8.
No, unix (and macOS _is_ unix) filenames are a simple stream of
bytes with no meaning or semantic associated with the bytes other
than the terminating nul character and the directory separator
character.
The Wikipedia says that APFS is UTF-8 capable. >https://en.wikipedia.org/wiki/Apple_File_System
So is linux. The operating system ascribes no meaning to the bytes
stored in the filesystem directories. They're just a stream of
bytes.
That's nonsense.
Every case-preserving case-insensitive file system has to understand characters encoding, at least to a certain degree.
Apple file systems can be configured to be case-sensitive, but it's
not default and recommended for none-specialist users.
One can treat them as UTF-8, which is generally the case. In which
case your objections about 'garbage' in a different locale are
pointless. UTF-8 fonts are universal. The currently locale doesn't
matter.
Windows, on the other hand, limits the character set to those that
can be described in 16-bit units, and the "locale" matters for not
only display purposes, but also for character processing.
It's rather hard to understand what you mean by above sentence.
If you meant to say that Windows file names have to use only
characters that were present in [mostly forgotten] UTC-2 character set
then you are mistaken.
If you meant something else then please express yourself more clearly.
If it was your usual instinctive Windows bashing then don't bother.
Am 28.04.2025 um 20:05 schrieb Janis Papanagnou:
(I thought Windows would use "UCS2". Anyway; would 16 bit suffice to
support full Unicode; I thought it wouldn't, or only old restricted
versions of Unicode.)
Windows is UTF-16 since Windows 2000, UCS2 before.
Am 28.04.2025 um 20:47 schrieb Richard Harnden:
On 28/04/2025 19:36, Bonita Montero wrote:
Am 28.04.2025 um 18:59 schrieb Scott Lurndal:
Not really. UTF-8 is UTF-8, regardless of the locale.
But UTF-8 isn't the standard locale for Unix filesystems
except with macOS.
UTF-8 isn't a locale - it's an encoding.
Idiot.
Type "locale" in the shell and thenn return.
Am 28.04.2025 um 20:47 schrieb Richard Harnden:...
UTF-8 isn't a locale - it's an encoding.
Idiot.
Type "locale" in the shell and thenn return.
Am 28.04.2025 um 20:47 schrieb Richard Harnden:
On 28/04/2025 19:36, Bonita Montero wrote:
Am 28.04.2025 um 18:59 schrieb Scott Lurndal:
Not really. UTF-8 is UTF-8, regardless of the locale.
But UTF-8 isn't the standard locale for Unix filesystems
except with macOS.
UTF-8 isn't a locale - it's an encoding.
Idiot.
Type "locale" in the shell and thenn return.
Am 28.04.2025 um 06:47 schrieb Lawrence D'Oliveiro:
On Mon, 28 Apr 2025 06:28:05 +0200, Bonita Montero wrote:
Am 28.04.2025 um 03:22 schrieb Lawrence D'Oliveiro:
That's not true. The codepoints for the surrogates were unused before.
The problem is the fact that you have to deal with surrogates.
That's trivial.
There's no standard locale for a filesystem. Linux sucks with that.
Am 29.04.2025 um 01:24 schrieb Richard Heathfield:
On 28/04/2025 22:26, Bonita Montero wrote:
Am 28.04.2025 um 20:47 schrieb Richard Harnden:
On 28/04/2025 19:36, Bonita Montero wrote:
Am 28.04.2025 um 18:59 schrieb Scott Lurndal:
Not really. UTF-8 is UTF-8, regardless of the locale.
But UTF-8 isn't the standard locale for Unix filesystems
except with macOS.
UTF-8 isn't a locale - it's an encoding.
Idiot.
Type "locale" in the shell and thenn return.
As David knows and you apparently don't, UTF-8 is an encoding,
not a locale.
If you must call people idiots, it's probably wisest to make
sure first that you're on solid ground.
UTF-8 has a locale, the chars between 128 and 255 have the locale
Latin 1.
There are no locales with UTF-16.
... but the Win32-APIs are more mature than the Posix-APIs ...
... he classifies Apple OS/X as Unix.
With Unix there's no locale defined for filesystem operations; it's arbitrary.
On Mon, 28 Apr 2025 17:03:46 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
So is linux. The operating system ascribes no meaning to the bytes
stored in the filesystem directories. They're just a stream of
bytes.
That's nonsense.
Every case-preserving case-insensitive file system has to understand >characters encoding, at least to a certain degree.
Am 29.04.2025 um 04:37 schrieb Lawrence D'Oliveiro:
On Mon, 28 Apr 2025 09:28:14 +0200, Bonita Montero wrote:
Am 28.04.2025 um 06:47 schrieb Lawrence D'Oliveiro:
On Mon, 28 Apr 2025 06:28:05 +0200, Bonita Montero wrote:
Am 28.04.2025 um 03:22 schrieb Lawrence D'Oliveiro:
That's not true. The codepoints for the surrogates were unused
before.
The problem is the fact that you have to deal with surrogates.
That's trivial.
I had to deal with it in Java code. It’s not trivial.
Far easier to have systems, like Python or Linux, which can deal with
full Unicode in a more native fashion.
I've got my u16_feeder iterator for that I'n using for years.
It has the same semantics like a pointer in C++ and it's as easy to use.
On Mon, 28 Apr 2025 22:29:14 +0300, Michael S wrote:
... he classifies Apple OS/X as Unix.
That is the only real “Unix” left. Linux is officially not “Unix”.
On Mon, 28 Apr 2025 13:30:17 +0200, Bonita Montero wrote:
With Unix there's no locale defined for filesystem operations; it's
arbitrary.
Don’t confuse “Unix” with “Linux”. On Linux, ASCII “/” is the pathname
component separator, and ASCII NUL is the pathname terminator. Everything else is simply passed through as is. In particular, I can use “∕” as part
of a pathname component, if I want.
On 29.04.2025 09:26, Lawrence D'Oliveiro wrote:
On Mon, 28 Apr 2025 22:29:14 +0300, Michael S wrote:
... he classifies Apple OS/X as Unix.
That is the only real “Unix” left. Linux is officially not “Unix”.
I'm not sure what sort of "officially" you have in mind.
As opposed to UNIX, a trademark and originally identifying the AT&T
version of a Unix system, the term Unix is usually used to classify
the _family_ of these operating systems. But MacOS X is in the line
of BSD Unixes (not AT&T). So both, Linux and MacOS X are Unixes.
Janis
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 20:47 schrieb Richard Harnden:
On 28/04/2025 19:36, Bonita Montero wrote:
Am 28.04.2025 um 18:59 schrieb Scott Lurndal:
Not really. UTF-8 is UTF-8, regardless of the locale.
But UTF-8 isn't the standard locale for Unix filesystems
except with macOS.
UTF-8 isn't a locale - it's an encoding.
Idiot.
Type "locale" in the shell and thenn return.
$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC=C
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
On 28/04/2025 22:26, Bonita Montero wrote:
Am 28.04.2025 um 20:47 schrieb Richard Harnden:
On 28/04/2025 19:36, Bonita Montero wrote:
Am 28.04.2025 um 18:59 schrieb Scott Lurndal:
Not really. UTF-8 is UTF-8, regardless of the locale.
But UTF-8 isn't the standard locale for Unix filesystems
except with macOS.
UTF-8 isn't a locale - it's an encoding.
Idiot.
Type "locale" in the shell and thenn return.
As David knows and you apparently don't, UTF-8 is an encoding, not a
locale.
If you must call people idiots, it's probably wisest to make sure first
that you're on solid ground.
On 28.04.2025 20:38, Bonita Montero wrote:
Am 28.04.2025 um 20:05 schrieb Janis Papanagnou:
(I thought Windows would use "UCS2". Anyway; would 16 bit suffice to
support full Unicode; I thought it wouldn't, or only old restricted
versions of Unicode.)
Windows is UTF-16 since Windows 2000, UCS2 before.
Oh, so it's a multi-word encoding (like UTF-8 is a multi-byte encoding)
and a character not necessarily encoded with only one 16 bit word... -
...but then I wonder even more where you see an advantage.
Am 29.04.2025 um 09:25 schrieb Lawrence D'Oliveiro:
If only ...
Note the limitations on Windows with
<https://docs.python.org/3/library/select.html>, just for example.
Windows has I/O-completion ports which are more flexible than select().
On 29/04/2025 01:13, Janis Papanagnou wrote:
Oh, so it's a multi-word encoding (like UTF-8 is a multi-byte encoding)
and a character not necessarily encoded with only one 16 bit word... -
...but then I wonder even more where you see an advantage.
When Unicode started, they thought 16 bits would be enough. UCS2 made
Which "David" are you referring to here?
Am 29.04.2025 um 04:39 schrieb Lawrence D'Oliveiro:
On Mon, 28 Apr 2025 09:44:27 +0200, Bonita Montero wrote:
There's no standard locale for a filesystem. Linux sucks with that.
Is your mission in life to try to make Windows look better than Linux?
You realize that’s futile, don’t you?
Not generally, but the Win32-APIs are more mature than the Posix-APIs,
but the implementation is slower.
Am 29.04.2025 um 02:28 schrieb Scott Lurndal:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 20:47 schrieb Richard Harnden:
On 28/04/2025 19:36, Bonita Montero wrote:
Am 28.04.2025 um 18:59 schrieb Scott Lurndal:
Not really. UTF-8 is UTF-8, regardless of the locale.
But UTF-8 isn't the standard locale for Unix filesystems
except with macOS.
UTF-8 isn't a locale - it's an encoding.
Idiot.
Type "locale" in the shell and thenn return.
$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC=C
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
For me:
boni@Raubtier-Asyl:/mnt/c/Users/Boni$ locale
LANG=C.UTF-8
LANGUAGE=
On Tue, 29 Apr 2025 10:58:46 +0200
David Brown <david.brown@hesbynett.no> wibbled:
On 29/04/2025 01:13, Janis Papanagnou wrote:
Oh, so it's a multi-word encoding (like UTF-8 is a multi-byte encoding)
and a character not necessarily encoded with only one 16 bit word... -
...but then I wonder even more where you see an advantage.
When Unicode started, they thought 16 bits would be enough. UCS2 made
They only had to check how many chinese pictograms there are to realise
that it was never going to be enough. Perhaps chinese wasn't considered important back then.
Am 29.04.2025 um 09:28 schrieb Lawrence D'Oliveiro:
On Mon, 28 Apr 2025 18:56:56 +0200, Bonita Montero wrote:
The data in the tar archive is locale-independent. Always.
The filenames not if they contain characters >= 128.
Doesn’t matter. They will still pack/unpack correctly.
Depends on the locale of the person who sees the filenames.
Case insensitve file systems are an abortion that no sane OS should use.
z/Os is alive and in good shape, but everybody knows that despite
the trademark it is not similar to Unix.
Am 29.04.2025 um 09:25 schrieb Lawrence D'Oliveiro:
If only ...
Note the limitations on Windows with
<https://docs.python.org/3/library/select.html>, just for example.
Windows has I/O-completion ports which are more flexible than select().
On Tue, 29 Apr 2025 07:40:33 -0000 (UTC), Muttley wrote:
Case insensitve file systems are an abortion that no sane OS should use.
Linux at least offers the option.
On Wed, 30 Apr 2025 01:54:06 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wibbled:
On Tue, 29 Apr 2025 07:40:33 -0000 (UTC), Muttley wrote:
Case insensitve file systems are an abortion that no sane OS should use.
Linux at least offers the option.
AFAIK there's no case insensitive filesystem for linux.
They only had to check how many chinese pictograms there are to realise
that it was never going to be enough.
On Wed, 30 Apr 2025 01:54:06 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wibbled:
On Tue, 29 Apr 2025 07:40:33 -0000 (UTC), Muttley wrote:
Case insensitve file systems are an abortion that no sane OS should
use.
Linux at least offers the option.
AFAIK there's no case insensitive filesystem for linux.
More relevant to this group, it make also be convenient for people
trying to work with big C code bases that were written on Windows and
you now want to compile (for whatever target you want) them on Linux.
I've seen code bases developed on Windows machines where the
capitalisation of include directives was inconsistent - that works on >case-insensitive filesystems, but not on case-sensitive systems. (Yes,
I know there are many other ways to deal with such issues, but putting
the source code in a case-insensitive directory on ext4 is one option.)
On Tue, 29 Apr 2025 11:17:50 +0300, Michael S wrote:
z/Os is alive and in good shape, but everybody knows that despite
the trademark it is not similar to Unix.
Just goes to show the worthlessness of the “Unix” name nowadays.
On Wed, 30 Apr 2025 09:45:20 +0200
David Brown <david.brown@hesbynett.no> wibbled:
More relevant to this group, it make also be convenient for people
trying to work with big C code bases that were written on Windows and
you now want to compile (for whatever target you want) them on Linux.
I've seen code bases developed on Windows machines where the
capitalisation of include directives was inconsistent - that works on
case-insensitive filesystems, but not on case-sensitive systems. (Yes,
I know there are many other ways to deal with such issues, but putting
the source code in a case-insensitive directory on ext4 is one option.)
I've seen on more than one occasion C++ (not C yet) projects where there
were 2 files only different in case, eg: Network.cpp and network.cpp where the former would be the class and the latter would be procedural support code.
Good luck unzipping that on Windows or any other case insensitive file system.
On 30/04/2025 11:52, Janis Papanagnou wrote:
On 30.04.2025 11:06, Muttley@DastardlyHQ.org wrote:
On Wed, 30 Apr 2025 09:45:20 +0200
David Brown <david.brown@hesbynett.no> wibbled:
More relevant to this group, it make also be convenient for peopleI've seen on more than one occasion C++ (not C yet) projects where there >>> were 2 files only different in case, eg: Network.cpp and network.cpp
trying to work with big C code bases that were written on Windows and
you now want to compile (for whatever target you want) them on Linux.
I've seen code bases developed on Windows machines where the
capitalisation of include directives was inconsistent - that works on
case-insensitive filesystems, but not on case-sensitive systems. (Yes, >>>> I know there are many other ways to deal with such issues, but putting >>>> the source code in a case-insensitive directory on ext4 is one option.) >>>
where
the former would be the class and the latter would be procedural
support code.
Good luck unzipping that on Windows or any other case insensitive
file system.
For low-level system software like network functionality that
would probably anyway not work on Windows in the first place
without change, independent of the capitalization. (But the
"case insensitive file system" issues, like the above mentioned
case inconsistencies, are of course an inherent problem.)
And there's of course a related problem if we port software with
longer maximum filename lengths to systems with shorter filename
lengths.
What systems are there now with filename length limits that would ever
be relevant to hand-typed names?
Filename length limits can
occasionally be relevant in some contexts (I've seen it in web spiders
that try to turn complete URL's into a single filenames),
but unless you are trying to compile code on DOS,
any system will support any length of
filename that someone would bother typing into an "#include" line.
[...]
Linus Torvalds has just had one of his famous rants in reference to case-insensitive options for Bcachefs :
[link snipped]
[...]
On Wed, 30 Apr 2025 09:45:20 +0200
David Brown <david.brown@hesbynett.no> wibbled:
More relevant to this group, it make also be convenient for people
trying to work with big C code bases that were written on Windows and
you now want to compile (for whatever target you want) them on Linux.
I've seen code bases developed on Windows machines where the
capitalisation of include directives was inconsistent - that works on
case-insensitive filesystems, but not on case-sensitive systems. (Yes,
I know there are many other ways to deal with such issues, but putting
the source code in a case-insensitive directory on ext4 is one option.)
I've seen on more than one occasion C++ (not C yet) projects where there
were 2 files only different in case, eg: Network.cpp and network.cpp where the former would be the class and the latter would be procedural support code.
Good luck unzipping that on Windows or any other case insensitive file system.
On 30.04.2025 11:06, Muttley@DastardlyHQ.org wrote:
On Wed, 30 Apr 2025 09:45:20 +0200
David Brown <david.brown@hesbynett.no> wibbled:
More relevant to this group, it make also be convenient for people
trying to work with big C code bases that were written on Windows and
you now want to compile (for whatever target you want) them on Linux.
I've seen code bases developed on Windows machines where the
capitalisation of include directives was inconsistent - that works on
case-insensitive filesystems, but not on case-sensitive systems. (Yes,
I know there are many other ways to deal with such issues, but putting
the source code in a case-insensitive directory on ext4 is one option.)
I've seen on more than one occasion C++ (not C yet) projects where there
were 2 files only different in case, eg: Network.cpp and network.cpp where >> the former would be the class and the latter would be procedural support code.
Good luck unzipping that on Windows or any other case insensitive file system.
For low-level system software like network functionality that
would probably anyway not work on Windows in the first place
without change, independent of the capitalization. (But the
"case insensitive file system" issues, like the above mentioned
case inconsistencies, are of course an inherent problem.)
And there's of course a related problem if we port software with
longer maximum filename lengths to systems with shorter filename
lengths.
On 30/04/2025 11:06, Muttley@DastardlyHQ.org wrote:
I've seen on more than one occasion C++ (not C yet) projects where there
were 2 files only different in case, eg: Network.cpp and network.cpp
where
the former would be the class and the latter would be procedural
support code.
I'd question the wisdom of such a convention. I'd rather have clearer separation of the filenames, or perhaps use different directories,
aiming to make it hard to mix up the names. But maybe it is an
appropriate choice in some situations - perhaps alternative naming
schemes were considered worse in other ways.
On 30.04.2025 11:06, Muttley@DastardlyHQ.org wrote:
On Wed, 30 Apr 2025 09:45:20 +0200
David Brown <david.brown@hesbynett.no> wibbled:
More relevant to this group, it make also be convenient for people
trying to work with big C code bases that were written on Windows and
you now want to compile (for whatever target you want) them on Linux.
I've seen code bases developed on Windows machines where the
capitalisation of include directives was inconsistent - that works on
case-insensitive filesystems, but not on case-sensitive systems. (Yes,
I know there are many other ways to deal with such issues, but putting
the source code in a case-insensitive directory on ext4 is one option.)
I've seen on more than one occasion C++ (not C yet) projects where there
were 2 files only different in case, eg: Network.cpp and network.cpp where >> the former would be the class and the latter would be procedural support >code.
Good luck unzipping that on Windows or any other case insensitive file >system.
For low-level system software like network functionality that
would probably anyway not work on Windows in the first place
On 30/04/2025 11:06, Muttley@DastardlyHQ.org wrote:
On Wed, 30 Apr 2025 09:45:20 +0200
David Brown <david.brown@hesbynett.no> wibbled:
More relevant to this group, it make also be convenient for people
trying to work with big C code bases that were written on Windows and
you now want to compile (for whatever target you want) them on Linux.
I've seen code bases developed on Windows machines where the
capitalisation of include directives was inconsistent - that works on
case-insensitive filesystems, but not on case-sensitive systems. (Yes,
I know there are many other ways to deal with such issues, but putting
the source code in a case-insensitive directory on ext4 is one option.)
I've seen on more than one occasion C++ (not C yet) projects where there
were 2 files only different in case, eg: Network.cpp and network.cpp where >> the former would be the class and the latter would be procedural support >code.
I'd question the wisdom of such a convention. I'd rather have clearer >separation of the filenames, or perhaps use different directories,
aiming to make it hard to mix up the names. But maybe it is an
appropriate choice in some situations - perhaps alternative naming
schemes were considered worse in other ways.
On 30.04.2025 03:53, Lawrence D'Oliveiro wrote:
On Tue, 29 Apr 2025 11:17:50 +0300, Michael S wrote:
z/Os is alive and in good shape, but everybody knows that despite
the trademark it is not similar to Unix.
Just goes to show the worthlessness of the “Unix” name nowadays.
"UNIX" has a meaning that varied historically. But "Unix" is
commonly used as a name for the family of "UNIX-like" systems;
that's very useful since it allows to formulate commonalities
of this OS family.[*]
Janis
[*] As we've seen in the discussion of Unix file systems with
its basic structure of being built by sequences of octets[**]
and having two distinguished characters '\0' and '/'.
[**] BTW; does anyone know how e.g. the [historic] Borroughs
Unix systems with their 9 bit/36 bit architecture had their
file systems defined (w.r.t. the octet transfer syntax)?
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
On 30.04.2025 03:53, Lawrence D'Oliveiro wrote:
On Tue, 29 Apr 2025 11:17:50 +0300, Michael S wrote:
z/Os is alive and in good shape, but everybody knows that despite
the trademark it is not similar to Unix.
Just goes to show the worthlessness of the “Unix” name nowadays.
"UNIX" has a meaning that varied historically. But "Unix" is
commonly used as a name for the family of "UNIX-like" systems;
that's very useful since it allows to formulate commonalities
of this OS family.[*]
[*] As we've seen in the discussion of Unix file systems with
its basic structure of being built by sequences of octets[**]
and having two distinguished characters '\0' and '/'.
[**] BTW; does anyone know how e.g. the [historic] Borroughs
s/Borroughs/Burroughs/
then
s/Burroughs/Sperry/
[...]
Its certainly not a scheme I'd use, but I've also seen Makefile and
makefile in the same package build directory in the past.
On 30.04.2025 15:41, Scott Lurndal wrote:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
On 30.04.2025 03:53, Lawrence D'Oliveiro wrote:
On Tue, 29 Apr 2025 11:17:50 +0300, Michael S wrote:
z/Os is alive and in good shape, but everybody knows that despite
the trademark it is not similar to Unix.
Just goes to show the worthlessness of the “Unix” name nowadays.
"UNIX" has a meaning that varied historically. But "Unix" is
commonly used as a name for the family of "UNIX-like" systems;
that's very useful since it allows to formulate commonalities
of this OS family.[*]
[*] As we've seen in the discussion of Unix file systems with
its basic structure of being built by sequences of octets[**]
and having two distinguished characters '\0' and '/'.
[**] BTW; does anyone know how e.g. the [historic] Borroughs
s/Borroughs/Burroughs/
then
s/Burroughs/Sperry/
Oh, sorry, I actually made even a more serious mistake beyond a typo;
s/Borroughs/Honeywell 6000/
But the question was not so much about the concrete system label but
the principle question what happens if a system's character width is
defined as 9 bit, the underlying hardware (like hard disks) probably
8 bit,
and a Unix OS file-system in between.
Janis
[...]
On Wed, 30 Apr 2025 12:38:53 -0000 (UTC), Muttley wrote:
Its certainly not a scheme I'd use, but I've also seen Makefile and
makefile in the same package build directory in the past.
The GNU “make” command, specified without a filename, looks for “GNUmakefile”, then “Makefile”, then “makefile”. The man page <https://manpages.debian.org/make(1)> says:
We recommend Makefile because it appears prominently near the
beginning of a directory listing, right near other important files
such as README.
But is this still true for most people? I think the default sort
settings these days no longer put all-caps names at the top.
On Wed, 30 Apr 2025 12:38:53 -0000 (UTC), Muttley wrote:
Its certainly not a scheme I'd use, but I've also seen Makefile and
makefile in the same package build directory in the past.
The GNU “make” command, specified without a filename, looks for “GNUmakefile”, then “Makefile”, then “makefile”. The man page <https://manpages.debian.org/make(1)> says:
We recommend Makefile because it appears prominently near the
beginning of a directory listing, right near other important files
such as README.
But is this still true for most people? I think the default sort
settings these days no longer put all-caps names at the top.
On Wed, 30 Apr 2025 12:38:53 -0000 (UTC), Muttley wrote:
Its certainly not a scheme I'd use, but I've also seen Makefile and
makefile in the same package build directory in the past.
The GNU “make” command, specified without a filename, looks for “GNUmakefile”, then “Makefile”, then “makefile”. The man page <https://manpages.debian.org/make(1)> says:
We recommend Makefile because it appears prominently near the
beginning of a directory listing, right near other important files
such as README.
But is this still true for most people? I think the default sort
settings these days no longer put all-caps names at the top.
On Thu, 01 May 2025 00:15:12 +0200, Janis Papanagnou wrote:
On 30.04.2025 15:41, Scott Lurndal wrote:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
On 30.04.2025 03:53, Lawrence D'Oliveiro wrote:
On Tue, 29 Apr 2025 11:17:50 +0300, Michael S wrote:
z/Os is alive and in good shape, but everybody knows that despite
the trademark it is not similar to Unix.
Just goes to show the worthlessness of the “Unix” name nowadays.
"UNIX" has a meaning that varied historically. But "Unix" is
commonly used as a name for the family of "UNIX-like" systems;
that's very useful since it allows to formulate commonalities
of this OS family.[*]
[*] As we've seen in the discussion of Unix file systems with
its basic structure of being built by sequences of octets[**]
and having two distinguished characters '\0' and '/'.
[**] BTW; does anyone know how e.g. the [historic] Borroughs
s/Borroughs/Burroughs/
then
s/Burroughs/Sperry/
Oh, sorry, I actually made even a more serious mistake beyond a typo;
s/Borroughs/Honeywell 6000/
But the question was not so much about the concrete system label but
the principle question what happens if a system's character width is
defined as 9 bit, the underlying hardware (like hard disks) probably
8 bit,
A quick read through the Wikipedia article on the Honeywell 6000 and
another read through the documentation on the (related) DDS190 disk
storage unit (see https://www.manualslib.com/manual/1939073/Honeywell-6000-Series.html?page=8#manual)
indicates that the hard disks used 6-bit characters.
That would mean
that, on disk, you could store a Honeywell 6000 36-bit word as 6 6bit characters (or 2 9bit program characters in 3 6bit storage characters).
and a Unix OS file-system in between.
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 29.04.2025 um 02:28 schrieb Scott Lurndal:
Bonita Montero <Bonita.Montero@gmail.com> writes:
Am 28.04.2025 um 20:47 schrieb Richard Harnden:
On 28/04/2025 19:36, Bonita Montero wrote:
Am 28.04.2025 um 18:59 schrieb Scott Lurndal:
Not really. UTF-8 is UTF-8, regardless of the locale.
But UTF-8 isn't the standard locale for Unix filesystems
except with macOS.
UTF-8 isn't a locale - it's an encoding.
Idiot.
Type "locale" in the shell and thenn return.
$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC=C
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
For me:
boni@Raubtier-Asyl:/mnt/c/Users/Boni$ locale
LANG=C.UTF-8
LANGUAGE=
Same locale, different encoding. As has been pointed out
to you repeatedly.
[...]
The partial rationale here being that the directory entries in this case
were fixed size (like FAT, albeit with longer names), and this could potentially make the difference between using a single directory entry
or needing a more complex LFN style scheme. Though, in this case, the
default name length is 48, and it is rare for a filename to not fit into
48 bytes.
[...]
[...]
Though, if someone really must make something case-insensitive, a case
could be made for only supporting it for maybe Latin, Greek, and
Cyrillic.
Ideally, this would be better handled in a file-browser or
similar, and not in the VFS or FS driver itself.
On 5/6/2025 6:19 PM, Janis Papanagnou wrote:
On 06.05.2025 20:01, BGB wrote:
[...]
The partial rationale here being that the directory entries in this case >>> were fixed size (like FAT, albeit with longer names), and this could
potentially make the difference between using a single directory entry
or needing a more complex LFN style scheme. Though, in this case, the
default name length is 48, and it is rare for a filename to not fit into >>> 48 bytes.
You mean rare in your application areas?
This appears to me like a very conservative size. While I'd agree
that it's probably a sensible value for own files with explicitly
chosen file names a lot of files that are downloaded regularly do
have longer file names. A quick check of my "Documents" directory
(that contains both, downloaded files and own files) shows a ratio
of 1563:629, i.e. roughly about 30% files of "document" type with
lengths > 48 (there's no files with a file name length > 128).
I recall someone here recently spoke about chosen lengths of 255
(or some such)for file names, which seems to be plenty, OTOH.
Running quick/dirty stats of everything on my "K:" drive, roughly 2
million files of various assorted types.
Stats (file names less than N bytes):
16: 66.40%
24: 87.85%
32: 95.38%
48: 99.31%
On 07.05.2025 12:08, BGB wrote:
[...]
Though, if someone really must make something case-insensitive, a
case could be made for only supporting it for maybe Latin, Greek,
and Cyrillic.
I don't understand what you want to say here; it just sounds strange
to me. - Mind to elaborate?
Ideally, this would be better handled in a file-browser or
similar, and not in the VFS or FS driver itself.
Janis
On 5/7/2025 8:45 AM, Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 5/6/2025 6:19 PM, Janis Papanagnou wrote:
On 06.05.2025 20:01, BGB wrote:
[...]
These component names are what the filesystem actually stores.
Combined path length can be somewhat longer.
A traditional limit is 260 chars though.
There wasn't a well defined path-length limit in my projects, though >informally typically somewhere between 256 and 768 bytes.
It is rare to see a path over 256, but, "if doing it properly" a
consistent length limit of, say, 512 or 768 would make sense. Going any >bigger is likely needlessly overkill.
Full paths will exceed the 48 bytes frequently, this, for example
is 142 bytes.
/work/music/Blues/Howlin' Wolf/The Chess Box (1963 - 1973) (disc 3)/20 - The Red Rooster (London Sessions w false start and dialog) (1970).wav
We don't usually store full paths in a filesystem, as each directory
exists as its own entity.
So, say, if broken down:
5 chars ("work")
6 chars ("music")
6 chars
13 chars
37 chars
75 chars.
Longest name counted here would be 75.
It is not all that uncommon for directories to get larger than a limit
where linear search is inefficient.
On 5/7/2025 7:58 AM, Janis Papanagnou wrote:
On 07.05.2025 12:08, BGB wrote:
[...]
Though, if someone really must make something case-insensitive, a case
could be made for only supporting it for maybe Latin, Greek, and
Cyrillic.
I don't understand what you want to say here; it just sounds strange
to me. - Mind to elaborate?
Latin, Greek, and Cyrillic, are the main alphabets which actually have a useful and reasonably well defined concept of "case", and thus "case
folding" actually makes sense for these.
For most other places, it does not, and one can likely ignore rules for things outside of these alphabets. Can eliminate a bunch of rules for alphabets that don't actually have "case" as we would understand it.
By limiting rules in these ways, a simpler and more manageable set of
rules is possible. Vs, say, actual Unicode rules, which tend to have
stuff going on all over the place.
Ligatures pose an issue though, but presumably option is one of:
Case fold between ligatures, when both variants exist;
Treat the ligature as its own character;
Decompose and compare.
Though, FWIW, in my normalization code, I mostly ignored ligatures, as
while they could be decomposed in many cases, they could only be
recomposed for locales that actually use said ligature (like, in
English, if AE and IJ started spontaneously merging into new characters,
this would be weird and out of place; and having a filesystem layer that merely decomposed any ligatures it encountered would not be ideal).
Ideally, this would be better handled in a file-browser or
similar, and not in the VFS or FS driver itself.
Janis
Ligatures pose an issue though ...
Ideally, filesystems should be case sensitive by default;
If someone wants case insensitivity, this can be better handled at the application or file-browser level.
[...]
Though, even for the Latin alphabet, once one goes much outside of ASCII
and Latin-1, it gets messy.
[...]
[...]
No matter how you choose to do it, you will get it wrong sometimes. Case-insensitive comparison has language-specific details in addition to
the character in the Unicode tables. Should the lower-case version of
"SS" be "ss" or "ß" ? That depends on the language and the position of
the letters. Should the capital of "ß" be "SS" or "ẞ"? [...]
[...]
But, I don't bother with C1 control codes, as they are unused ...
In some contexts, may or may not also have ANSI escape sequences, though generally no text editors deal with or make use of ANSI escapes.
AFAIK, the only other script that shares all of these properties is
Armenian.
Either way, case-insensitivity at the FS level adds complexity.
I guess, one intermediate option could be to keep the FS proper as case sensitive, but then fake case insensitivity at the level of the OS APIs (based on a system-level locale setting).
BGB <cr88192@gmail.com> writes:
[...][...]
The Latin-1 8-bit character set is largely obsolete. Whatever point
you're making, I suspect you could make it much more clearly without
any reference to Latin-1 or Windows-1252.
[...][...]
It is 8-bit and byte-based, and informally I think, most
extended-ASCII codepages were collectively known as ASCII even if only
the low 7-bit range is ASCII proper (and I think more for sake of
contrast with "Not Unicode", eg, UTF-8 / UTF-16 / UCS-2 / ...).
No, 8-bit character sets are not ASCII. Calling them "extended ASCII"
is reasonable.
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
In a "C" file (of the Kornshell software) I stumbled across this
comment: "Each command in the history file starts on an even byte
and is null-terminated."
I wonder what's the reason behind that even-byte-alignment, on "C"
level or on Unix/files level. Any ideas?
This question led to a number of long digressions, most of which didn't address the original question.
The quoted comment is in src/cmd/ksh93/edit/history.c in <https://github.com/ksh93/ksh>. It goes on to mention versions 0
and 1 of the history file format.
I haven't been able to find sources for ksh that would shed any light on this.
The even byte requirement in version 1 was likely inherited from version
0. The initial commit in the git repo includes release notes going back
to 1987, but no old versions of the source code.
My best guess is that the author of some early version of ksh, when
first defining the Version 0 history file format, just thought that even
byte alignment was a good idea at the time. There might not be any
deeper reason than that.
On Thu, 8 May 2025 01:57:05 -0500, BGB wrote:
Either way, case-insensitivity at the FS level adds complexity.
If you look around some other groups, you will see discussion of a recent rant from Linus Torvalds on this very issue. Basically, he doesn’t like case-insensitivity. And he is justified in pointing out that it leads to
more opportunities for bugs in the kernel code.
The only reason we need to
have it is because it makes certain things easier for users.
On Thu, 8 May 2025 01:57:05 -0500, BGB wrote:
Either way, case-insensitivity at the FS level adds complexity.
If you look around some other groups, you will see discussion of a recent >rant from Linus Torvalds on this very issue. Basically, he doesn’t like >case-insensitivity. And he is justified in pointing out that it leads to
more opportunities for bugs in the kernel code.
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
In a "C" file (of the Kornshell software) I stumbled across this
comment: "Each command in the history file starts on an even byte
and is null-terminated."
I wonder what's the reason behind that even-byte-alignment, on "C"
level or on Unix/files level. Any ideas?
Janis
Note: Since it's not a shell question but more of a programming or
platform related question I try to get the answer here (and not in
comp.unix.shell); just saying to prevent distracting calls to order.
This question led to a number of long digressions, most of which didn't >address the original question.
The quoted comment is in src/cmd/ksh93/edit/history.c in ><https://github.com/ksh93/ksh>. It goes on to mention versions 0
and 1 of the history file format.
I haven't been able to find sources for ksh that would shed any light on >this.
The even byte requirement in version 1 was likely inherited from version
0. The initial commit in the git repo includes release notes going back
to 1987, but no old versions of the source code.
My best guess is that the author of some early version of ksh, when
first defining the Version 0 history file format, just thought that even
byte alignment was a good idea at the time. There might not be any
deeper reason than that.
Unicode hasn't locales ...
On 5/8/2025 9:26 PM, Lawrence D'Oliveiro wrote:
Possibly, though generally, ESC+[ is used IME.
On Thu, 8 May 2025 18:50:33 -0500, BGB wrote:
But, I don't bother with C1 control codes, as they are unused ...
Mostly true. But I think terminal emulators do interpret CSI as
equivalent to ESC followed by “[”.
Also creates uncertainty, as AFAIK the terminals traditionally operate
on raw bytes regarding ANSI commands, whereas if the terminal interface
is UTF-8, a CSI (as a 2-byte encoding) would not be equivalent to 0x9B
(if encoded as a single byte).
I was thinking here more of a GUI based editor or pseudo-word processor; where Text + ANSI codes could, in theory, serve a similar role to the
RTF format, although more as extended text rather than a sort of markup language (though, modern word processors typically use XML internally,
as opposed to the more unusual markup scheme that RTF had used).
Sometimes, it would also be nice if there was a sort of a standalone graphical viewer/editor that used MediaWiki or Markdown or AsciiDoc or similar.