• Globbing versus regular expressions

    From Axel Reichert@21:1/5 to All on Sun Jul 21 09:01:18 2024
    Hi all,

    a colleague (new to command line wizardry) seemed puzzled by the
    existence of both globbing for file names (shell) and regular
    expressions for strings (many other command line tools).

    Since I am familiar with both mechanisms for decades, I never thought
    about this "redundancy", but now I think he has a point, even more so if
    you are using the "dired" file manager in Emacs, which further blurs the distinction between mangling text and working on files.

    Since regexes are (at quick glance) a superset of globs, why not
    consistently use the former for both file names and strings? The
    few additional keystrokes (.* instead of *) are IMHO easily compensated
    for by the more powerful capabilities of regexes.

    A little reading on Wikipedia showed that both came into popular usage
    in the early 70s. So why was globbing not dropped and regexes used
    throughout? It seems that ksh93 supports "regex globbing". bash has
    "extended globbing", but this seems a clumsy, bolted-on solution. Are
    there shells out there which follow a regex-only approach (of this would
    be non-POSIX)?

    Happy for any further insights (technical or historical) shed on this
    topic!

    Axel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Joerg Mertens@21:1/5 to Axel Reichert on Sun Jul 21 11:36:47 2024
    Axel Reichert <mail@axel-reichert.de> wrote:
    Hi all,

    a colleague (new to command line wizardry) seemed puzzled by the
    existence of both globbing for file names (shell) and regular
    expressions for strings (many other command line tools).

    Since I am familiar with both mechanisms for decades, I never thought
    about this "redundancy", but now I think he has a point, even more so if
    you are using the "dired" file manager in Emacs, which further blurs the distinction between mangling text and working on files.

    Since regexes are (at quick glance) a superset of globs, why not
    consistently use the former for both file names and strings? The
    few additional keystrokes (.* instead of *) are IMHO easily compensated
    for by the more powerful capabilities of regexes.

    A little reading on Wikipedia showed that both came into popular usage
    in the early 70s. So why was globbing not dropped and regexes used throughout? It seems that ksh93 supports "regex globbing". bash has
    "extended globbing", but this seems a clumsy, bolted-on solution. Are
    there shells out there which follow a regex-only approach (of this would
    be non-POSIX)?

    Happy for any further insights (technical or historical) shed on this
    topic!

    I guess, glob style matching is just easier to learn, especially
    for non-technical persons. In regular expressions you have more
    special characters you cannot use unescaped, like e.g. the dot,
    which is part of many filenames. This makes it more inconvenient
    to use in the context of file handling and it also takes more time
    to learn all the details. I know both systems and I also tend to
    use globbing when working with files, even if regex is available.
    Most of the time you don't need the more complex possibilities regex
    provides.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Axel Reichert on Sun Jul 21 11:36:08 2024
    On 21.07.2024 09:01, Axel Reichert wrote:
    Hi all,

    a colleague (new to command line wizardry) seemed puzzled by the
    existence of both globbing for file names (shell) and regular
    expressions for strings (many other command line tools).

    Since I am familiar with both mechanisms for decades, I never thought
    about this "redundancy", but now I think he has a point, even more so if
    you are using the "dired" file manager in Emacs, which further blurs the distinction between mangling text and working on files.

    Since regexes are (at quick glance) a superset of globs, why not
    consistently use the former for both file names and strings? The
    few additional keystrokes (.* instead of *) are IMHO easily compensated
    for by the more powerful capabilities of regexes.

    A little reading on Wikipedia showed that both came into popular usage
    in the early 70s. So why was globbing not dropped and regexes used throughout? It seems that ksh93 supports "regex globbing". bash has
    "extended globbing", but this seems a clumsy, bolted-on solution. Are
    there shells out there which follow a regex-only approach (of this would
    be non-POSIX)?

    Happy for any further insights (technical or historical) shed on this
    topic!

    As you noticed the two forms came from different sort of sources.
    Globbing was originally primitive (*, ?, [...], [!...]), and the
    regexps implemented parsing/matching of the Regular Expressions
    class of formal languages.

    The syntax of globbing made it impossible to switch to BRE or ERE
    (the two common forms on Unix systems that supported regexps;
    meanwhile there are yet more variants with syntactic sugar, but
    also extensions of the class of Regular Expressions, accompanied
    with its own caveats). The characters, say, '.' or '?' have just
    different meanings. That's why Kornshell introduced extensions of
    globbing to support some patterns known from BRE/ERE, for example
    *(X) for X*, +(X) for X+, ?(X) for X?, and a few own others, like
    @(X) and !(X). At some point, because of the prevalence of BRE/ERE
    the supported both forms in cases where regular expressions have
    to be parsed.

    The Kornshell patterns are (with one subtle corner case) a full
    regular expression vehicle, and they are certainly no "subset" of
    regexps as the more primitive (historic) globbing functions (that
    are still used in most other shells).

    You cannot just remove "globbing regexps" because they are used.
    Similar with BRE/ERE. You would just break most code; code based
    on globbing and code based on BRE/ERE. So I think Kornshell did
    it the right way, to support both. And with Kornshell globbing we
    can of course also express much more powerful patterns than with
    the primitive historic globbing facilities.

    But even the Unix regexps have different (meta-)syntax depending
    on the tools; the reason is that some characters have a native
    meaning, thus the regexp metacharacter has to be marked (escaped)
    to get one meaning or the other; sometimes the escape '\' symbol
    makes the subsequent character a regexp meta-character, in other
    tools it makes it a native non-metacharacter.

    So it would make little sense (IMO) if there's "shells out there
    which follow a regex-only approach"; they would certainly be non-
    standard and most likely of limited use (even if used only in own
    restricted ecosystems).

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Axel Reichert on Sun Jul 21 15:09:46 2024
    On 2024-07-21, Axel Reichert <mail@axel-reichert.de> wrote:
    Hi all,

    a colleague (new to command line wizardry) seemed puzzled by the
    existence of both globbing for file names (shell) and regular
    expressions for strings (many other command line tools).

    path name globs are regular expressions. You can mechanically translate
    globs to regular expressions.

    If you have a regex engine and access to traversing the filesystem,
    you can write glob.

    globs are a syntactic sugar for a subset of regex. They are not as
    powerful, but they are more concise and ergonomic for their target
    use cases.

    Since regexes are (at quick glance) a superset of globs, why not
    consistently use the former for both file names and strings? The

    globs are used for strings; check out the case statement in the
    POSIX shell language, and the fnmatch C library function.

    You see uses of glob-like processing outside of Unix.
    For instance, the Redirector browser extension for Firefox
    uses glob-like patterns such as:

    https://example.com/*/foo/*.html

    where in the right hand side of the rewrite pattern you can refer
    to the parts matched by the * syntax, in left to right order
    using $1, $2, ...

    few additional keystrokes (.* instead of *) are IMHO easily compensated
    for by the more powerful capabilities of regexes.

    So you might think, but it would actually be a nuisance, and
    trip up people. (Even the coders, never mind non-coders.)

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Axel Reichert on Sun Jul 21 21:29:31 2024
    On Sun, 21 Jul 2024 09:01:18 +0200, Axel Reichert wrote:

    Since regexes are (at quick glance) a superset of globs, why not
    consistently use the former for both file names and strings?

    I would guess:

    1) Historical reasons; I would say wildcards originated before regexes.
    2) Issues with syntax: for example, dots mean “any character” in a regex,
    while they occur quite commonly in filenames, and having to escape them
    all the time would be a pain.
    3) Simplicity. Wildcards are much more limited than regexes, but they
    are convenient for a lot of common cases.

    Also, regexes have evolved a lot over time. They were considered fairly
    exotic at the time of the origin of tools like awk or sed, and also they
    varied quite a lot in syntax and capabilities. I would credit the coming
    of Perl with popularizing the idea in quite a powerful form--so powerful
    that “Perl-Compatible Regular Expression” (“PCRE”) has become the closest,
    I think to a de-facto standard for how regexes should work.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Axel Reichert@21:1/5 to Ed Morton on Sat Aug 10 00:02:40 2024
    Ed Morton <mortonspam@gmail.com> writes:

    There's an IMO interesting Q&A about globs and regexps, including some history, at:

    https://unix.stackexchange.com/questions/136353/history-of-bash-globbing

    Many thanks, very interesting. It essentially expands a little on all
    the points brought forward here. There was a nice example that showed
    how painful regexes can be for a simple globbing task.

    Thanks also to the others who bothered to share their thoughts.

    Somehow I expected Plan 9's shell, rc, to have a more consistent
    approach, but it also features standard globbing pattern.

    Best regards

    Axel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Axel Reichert on Sat Aug 10 00:33:30 2024
    On Sat, 10 Aug 2024 00:02:40 +0200, Axel Reichert wrote:

    Somehow I expected Plan 9's shell, rc, to have a more consistent
    approach, but it also features standard globbing pattern.

    Plan 9 is one of those things you expect to have lots of new, clever stuff
    in it, but it doesn’t.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)