• GNU Awk's types of regular expressions

    From Janis Papanagnou@21:1/5 to All on Thu Nov 28 19:18:29 2024
    In GNU Awk there's currently three types of regular expressions, in
    addition to the standard regexp-constants (/regex/) and the dynamic
    regexps ("regex", or variables containing "regex") there's in newer
    versions also first class regexp objects (@/regex/, "Strongly Typed
    Regexp Constants") supported.

    One principal advantage of regexp-constants is that the engine to
    parse the regexp can be created in advance, while a dynamic regexp
    may be constructed dynamically (from strings) and needs an explicit runtime-step to create the engine before the matching can be done.
    Now I assumed that @/regex-const/ would in that respect behave as
    /regex-const/ ... - until I found in the GNU Awk manual this text:

    |
    | Thus, if you have something like this:
    |
    | re = @/don't panic/
    | sub(/don't/, "do", re)
    | print typeof(re), re
    |
    | then re retains its type, but now attempts to match the string ‘do
    | panic’. This provides a (very indirect) way to create regexp-typed
    | variables at runtime.
    |

    (I'm astonished that first class regexp objects can be dynamically
    changed. But that is not my point here; I'm interested in potential pre-compiles of regexp constants...)

    This would imply that the first class regexp constants can be changed
    like dynamic regexps and that there's no regexp pre-compile involved.
    This would also rise suspicion that the "normal" regexp-constants are
    probably also not precomputed.

    So constant-regexps (both forms) have (only?) the advantage that the regexp-syntax can be (initially during awk parsing) checked, e.g.,

    re = @/don't panic[/
    ^ unterminated regexp

    And dynamic regexps and first class regexps that got changed (e.g.
    by code like

    sub(/don't/, "do[", re)

    in above sample snippet) would both create runtime errors, e.g.

    error: Unmatched [, [^, [:, [., or [=: /do[ panic/
    fatal: could not make typed regex

    (as all ill-formed regexp-types will produce a runtime error).

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Janis Papanagnou on Fri Nov 29 04:13:43 2024
    On 2024-11-28, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    In GNU Awk there's currently three types of regular expressions, in
    addition to the standard regexp-constants (/regex/) and the dynamic
    regexps ("regex", or variables containing "regex") there's in newer
    versions also first class regexp objects (@/regex/, "Strongly Typed
    Regexp Constants") supported.

    One principal advantage of regexp-constants is that the engine to
    parse the regexp can be created in advance, while a dynamic regexp
    may be constructed dynamically (from strings) and needs an explicit runtime-step to create the engine before the matching can be done.
    Now I assumed that @/regex-const/ would in that respect behave as
    /regex-const/ ... - until I found in the GNU Awk manual this text:

    |
    | Thus, if you have something like this:
    |
    | re = @/don't panic/
    | sub(/don't/, "do", re)
    | print typeof(re), re
    |
    | then re retains its type, but now attempts to match the string ‘do
    | panic’. This provides a (very indirect) way to create regexp-typed
    | variables at runtime.
    |

    (I'm astonished that first class regexp objects can be dynamically
    changed. But that is not my point here; I'm interested in potential pre-compiles of regexp constants...)

    I would flatly reject a commit to do such a thing. Yikes!

    What representation is it working on? If the regex contains
    a match for a literal backslash using escaping, does that
    count as two backslash characters when you operate on it?
    Or is it a single backslash? Can you replace the second
    backslash with an 'n' and have the pair turn into a newline?

    Is it just tromboning back to printed representation,
    and then parsing again?

    I provide this:

    1> (regex-source #/a.*b(c|d)/)
    (compound #\a (0+ wild) #\b (or #\c #\d))

    You can get the source code of the regex object as a nested
    list with symbols, characters and other objects.

    When you have this, you can analyze and transform it.

    Then you can call regex-compile on the result.

    For instance, prepend a match for the z character:

    2> (regex-compile ^(compound #\z ,*(cdr *1)))
    #/za.*b(c|d)/

    This is robust; you're not dealing with any character-syntax issues like escapes, because you have the abstract syntax tree of the regex.

    This would imply that the first class regexp constants can be changed
    like dynamic regexps and that there's no regexp pre-compile involved.

    Not necessarily; it could be that a new regex is compiled, and put into
    the re variable, clobbering the old regex, which is freed (if it
    hits a refcount of zero or whatever mem management is used).

    It could also (in combination with this) be lazy. So that is to say
    @/abc/ will just store the textual source code of the regex into
    the regex object, but not compile anything. When it comes time to
    use the regex, on first use, it is compiled and then cached into
    that object. When the regex is edited, the cache is invalidated.

    Someone will undoubtedly chime in confirming or refuting these
    hypotheses.

    It would be pretty silly if these regex objects didn't cache a compiled
    regex across multiple uses.

    And dynamic regexps and first class regexps that got changed (e.g.
    by code like

    sub(/don't/, "do[", re)

    in above sample snippet) would both create runtime errors, e.g.

    Have you tried this? Do you get an error at sub() time, or when
    you later try to use re?

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kaz Kylheku on Fri Nov 29 09:33:57 2024
    On 29.11.2024 05:13, Kaz Kylheku wrote:
    On 2024-11-28, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    [...]

    And dynamic regexps and first class regexps that got changed (e.g.
    by code like

    sub(/don't/, "do[", re)

    in above sample snippet) would both create runtime errors, e.g.

    Have you tried this?

    Yes. (With a response that appeared in my post behind the "e.g." [that
    you snipped].)

    Do you get an error at sub() time, or when you later try to use re?

    It seems to appear with sub(); in the snippet
    ...
    print "PRE"
    sub(/don't/, "do[", re)
    print "POST"
    print typeof(re), re
    ...
    "PRE" ist printed but not "POST".

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kaz Kylheku on Sat Nov 30 12:41:52 2024
    Coming back to this...

    On 29.11.2024 05:13, Kaz Kylheku wrote:
    [...]

    It could also (in combination with this) be lazy. [...]

    Yes. There's already something like "on-demand logic" there, where
    in print > "a_file" the file won't be created or overwritten if
    the statement doesn't get triggered, and subsequent calls won't
    overwrite it. So it would indeed be not surprising if such a
    mechanism is implemented. (But I haven't examined the awk code.)


    Someone will undoubtedly chime in confirming or refuting these
    hypotheses.

    It would be pretty silly if these regex objects didn't cache a compiled
    regex across multiple uses.

    True. But, OTOH, in GNU Awk there's a couple functions that are
    just passed through to other (external) library functions. If these
    functions happen to support only an interface like match(re,str)
    where match() supports no [thread-safe] static memory for "re"
    the caller might have no choice. (Don't know how it's actually
    implemented.)

    Janis

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Aharon Robbins@21:1/5 to Janis on Sun Dec 1 20:20:22 2024
    Hi. Mack The Knife pointed me at this question.

    This kind of query should go to the bug list (where I'll see it).
    I skim the help list occasionally but don't reply to mails there.

    In article <viac5m$l8oh$1@dont-email.me> Janis writes:
    In GNU Awk there's currently three types of regular expressions, in
    addition to the standard regexp-constants (/regex/) and the dynamic
    regexps ("regex", or variables containing "regex") there's in newer
    versions also first class regexp objects (@/regex/, "Strongly Typed
    Regexp Constants") supported.

    One principal advantage of regexp-constants is that the engine to
    parse the regexp can be created in advance, while a dynamic regexp
    may be constructed dynamically (from strings) and needs an explicit >runtime-step to create the engine before the matching can be done.

    Even for such dynamically created regexps, the regexp is compiled once and cached, not compiled each time it's used (as long as it doesn't change).

    Now I assumed that @/regex-const/ would in that respect behave as
    /regex-const/ ... - until I found in the GNU Awk manual this text:

    | Thus, if you have something like this:
    |
    | re = @/don't panic/
    | sub(/don't/, "do", re)
    | print typeof(re), re
    |
    | then re retains its type, but now attempts to match the string ‘do
    | panic’. This provides a (very indirect) way to create regexp-typed
    | variables at runtime.

    (I'm astonished that first class regexp objects can be dynamically
    changed. But that is not my point here; I'm interested in potential >pre-compiles of regexp constants...)

    Since `re' is a variable, it can be changed, just as when you do

    str = "don't panic"
    sub(/don't/, "do", str)

    This would imply that the first class regexp constants can be changed
    like dynamic regexps and that there's no regexp pre-compile involved.

    "Not so, Watson! Not so!" When you do

    re = @/don't panic/

    gawk uses reference counted pointers to the original object; the
    original strongly typed regexp is precompiled and remains that way.

    As soon as you go to *change* `re', gawk makes a copy of the string
    value of the orginal regexp, makes the substitution, notes that
    it's a strongly typed regexp, and compiles the new regexp. From then
    on, the cached compiled regexp is used for matching.

    This would also rise suspicion that the "normal" regexp-constants are >probably also not precomputed.

    Also not true.

    So constant-regexps (both forms) have (only?) the advantage that the >regexp-syntax can be (initially during awk parsing) checked, e.g.,

    re = @/don't panic[/
    ^ unterminated regexp

    Incorrect, they are compiled when the program is parsed.

    And dynamic regexps and first class regexps that got changed (e.g.
    by code like

    sub(/don't/, "do[", re)

    in above sample snippet) would both create runtime errors, e.g.

    error: Unmatched [, [^, [:, [., or [=: /do[ panic/
    fatal: could not make typed regex

    (as all ill-formed regexp-types will produce a runtime error).

    Well, of course.

    In short, I jump through a lot of hoops in order to avoid recompiling
    regexps if it's not necessary.

    Hope this helps,

    Arnold
    --
    Aharon (Arnold) Robbins arnold AT skeeve DOT com

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Aharon Robbins on Sun Dec 1 22:17:02 2024
    On 01.12.2024 21:20, Aharon Robbins wrote:
    Hi. Mack The Knife pointed me at this question.

    This kind of query should go to the bug list (where I'll see it).

    Oh, I haven't considered what I wrote and suspected as a bug, so
    it didn't occur to me to use a bug-mailing list.

    [ explanations snipped ]

    In short, I jump through a lot of hoops in order to avoid recompiling
    regexps if it's not necessary.

    Hope this helps,

    Yes. Thanks for shedding light on the internals. And glad to hear
    how it's actually implemented.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Aharon Robbins@21:1/5 to janis_papanagnou+ng@hotmail.com on Sun Dec 1 23:18:43 2024
    In article <viijof$2q4u6$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    This kind of query should go to the bug list (where I'll see it).

    Oh, I haven't considered what I wrote and suspected as a bug, so
    it didn't occur to me to use a bug-mailing list.

    Legitimate questions like this about how gawk works internally,
    even if not bug reports, are welcome on the bug list. Sending them
    there makes it easy for me to respond to them.

    And of course, you can always look at the source code.

    Arnold
    --
    Aharon (Arnold) Robbins arnold AT skeeve DOT com

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Aharon Robbins on Mon Dec 2 08:00:20 2024
    On 02.12.2024 00:18, Aharon Robbins wrote:

    And of course, you can always look at the source code.

    While I do that occasionally with some [better known] software
    packages I'm not familiar with the GNU Awk source code and it
    would IME require quite some analysis, how it's structured,
    what's going on, and in the end you are typically never quite
    sure whether it does what you think it does.

    This isn't meant as a statement of quality of software design
    or existence of useful comments in GNU Awk. It's only so that
    last time I looked into the sources (with the intention to add
    new syntax and semantic for a feature I'd have liked) I wasn't
    able to identify how to do it without doing harm to the code;
    I'm lacking the familiarity with this source code. Of course I
    could have looked into the source code instead of posting, but
    the described experience lead me to not take that path.

    Re "(where I'll see it)": My post's intention was not meant to
    address/bother you personally - yet, all the more I appreciate
    your reply! In this newsgroup there's also some folks who have
    some expertise and might answer such questions. And I'm not a
    "client" of the mailing list. (Just to make you understand why
    I used this Usenet communication channel.) And finally, there
    was some discussion recently in another newsgroup about Regexps
    and I wanted to initiate a potential discussion on the topic.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Aharon Robbins@21:1/5 to janis_papanagnou+ng@hotmail.com on Mon Dec 2 20:58:39 2024
    In article <vijlu6$35un4$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    This isn't meant as a statement of quality of software design
    or existence of useful comments in GNU Awk. It's only so that
    last time I looked into the sources (with the intention to add
    new syntax and semantic for a feature I'd have liked) I wasn't
    able to identify how to do it without doing harm to the code;
    I'm lacking the familiarity with this source code. Of course I
    could have looked into the source code instead of posting, but
    the described experience lead me to not take that path.

    You can always ask me directly.

    Re "(where I'll see it)": My post's intention was not meant to
    address/bother you personally - yet, all the more I appreciate
    your reply! In this newsgroup there's also some folks who have
    some expertise and might answer such questions.

    True, but ultimately I'm authoritative. :-)

    And I'm not a "client" of the mailing list.

    You don't have to be subscribed to the bug list to send messages
    there.

    Arnold
    --
    Aharon (Arnold) Robbins arnold AT skeeve DOT com

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Aharon Robbins on Mon Dec 2 23:13:31 2024
    On 02.12.2024 21:58, Aharon Robbins wrote:
    In article <vijlu6$35un4$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    [...]

    You can always ask me directly.

    Thanks :-)

    (It's been so long that I wrote you that I completely forgot about
    that possibility. Shame on me.)


    Re "(where I'll see it)": My post's intention was not meant to
    address/bother you personally - yet, all the more I appreciate
    your reply! In this newsgroup there's also some folks who have
    some expertise and might answer such questions.

    True, but ultimately I'm authoritative. :-)

    Undisputedly ;-)


    And I'm not a "client" of the mailing list.

    You don't have to be subscribed to the bug list to send messages
    there.

    Good to know.

    See you,
    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)