Forum: Too Lazy BBS

Who's Online
Recent Visitors
- Geek2
  Tue Jul 1 06:29:00 2025
  from Euclid, Oh via Telnet
- Geek2
  Mon Jun 30 21:22:34 2025
  from Euclid, Oh via Telnet
- Sykotik
  Mon Jun 30 19:47:17 2025
  from Canada via Telnet
- Sykotik
  Mon Jun 30 18:27:57 2025
  from Canada via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	26
Nodes:	6 (1 / 5)
Uptime:	65:24:14
Calls:	482
Calls today:	1
Files:	1,072
Messages:	96,348

Re: Experiences with match() subexpressions?

From Janis Papanagnou@21:1/5 to Janis Papanagnou on Thu Apr 10 09:09:55 2025

On 10.04.2025 09:06, Janis Papanagnou wrote:

I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example

data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?

To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to All on Thu Apr 10 09:06:34 2025

I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example

data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to janis_papanagnou+ng@hotmail.com on Thu Apr 10 11:08:55 2025

In article <vt7qs4$2gior$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 10.04.2025 09:06, Janis Papanagnou wrote:

I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example

data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?

To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.

I have to admit that I (still) don't really understand how this match third
arg stuff works. I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to
use it.

I adapted your code into the following test script:

--- Cut Here ---
#!/bin/sh
gawk 'BEGIN {
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
for (i in arr) print i,arr[i]
}'

# To clarify; what I wanted is access of the values "r1", "r2", "r3",
# and "e" through 'arr'.
--- Cut Here ---

The output I get is:

--- Cut Here ---
0⌙start 1
0⌙length 18
3⌙start 18
1⌙start 11
2⌙start 13
3⌙length 1
2⌙length 2
1⌙length 5
0 R=r1,R=r2,R=r3,E=e
1 R=r3,
2 r3
3 e
--- Cut Here ---

After playing around a bit, I could not come up with any sensible way of getting what you want to get.

As an alternative, it sounds like you could just could just split the
string on the comma; that would get you:

R=r1
R=r2
R=r3
E=e

Or, for finer control, you could use patsplit().

--
The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be f

From Janis Papanagnou@21:1/5 to Kenny McCormack on Thu Apr 10 13:55:07 2025

On 10.04.2025 13:08, Kenny McCormack wrote:

In article <vt7qs4$2gior$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 10.04.2025 09:06, Janis Papanagnou wrote:

I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example

data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?

To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.

I have to admit that I (still) don't really understand how this match third arg stuff works.

I've never used that before but it seems to be quite simple; for every parenthesis group expression in the regexp it provides (statically, as
the parentheses are written, from left to right) an array element with
the expanded matched subexpression.

I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to use it.

I adapted your code into the following test script:

--- Cut Here ---
#!/bin/sh
gawk 'BEGIN {
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
for (i in arr) print i,arr[i]
}'

# To clarify; what I wanted is access of the values "r1", "r2", "r3",
# and "e" through 'arr'.
--- Cut Here ---

The output I get is:

--- Cut Here ---
0⌙start 1
0⌙length 18
3⌙start 18
1⌙start 11
2⌙start 13
3⌙length 1
2⌙length 2
1⌙length 5

Above output stuff appears because in 'arr' there's additional elements
about the pattern positions stored.

I don't need that so I'm just interested in the data patterns below and
iterate with a index-counted loop...

0 R=r1,R=r2,R=r3,E=e

the whole expression

1 R=r3,

the expression in the first parenthesis

2 r3

the expression in the second, embedded parenthesis

3 e

the expression in the final parenthesis

--- Cut Here ---

After playing around a bit, I could not come up with any sensible way of getting what you want to get.

Yeah, Arnold just told me the same; that it's impossible because the
underlying GNU regexp library doesn't support what I'm looking for.

What I considered a possible workaround (in this case) is to sequence
the (...){2,5} expression by using sequences of (...)? expressions.
(But in the general case, for larger ranges than 2-5, that's neither
feasible nor sensible any more.)

As an alternative, it sounds like you could just could just split the
string on the comma; that would get you:

Yes, that was also how I did such things in the past. Only when I saw
that "third argument" to match() I hoped the two-level parsing could
be simplified in one step. The reason was that I thought to have seen
other languages (Perl, maybe?) that supported such a feature.

R=r1
R=r2
R=r3
E=e

Or, for finer control, you could use patsplit().

I think I'll do the parsing the straightforward two-step way as I did
before the GNU Awk specific functions were available; it's probably
also the clearest way to program that functionality.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to janis_papanagnou+ng@hotmail.com on Thu Apr 10 14:04:46 2025

In article <vt8bit$2uiq5$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...

I have to admit that I (still) don't really understand how this match third >> arg stuff works.

...

I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to >> use it.

...

Above output stuff appears because in 'arr' there's additional elements
about the pattern positions stored.

Just to clarify, I wasn't looking for a tutorial (man page regurgitation).
I understand the man page description of match's 3rd arg as well as anyone;
I just find it that it doesn't do as much in practice as (I think) it
should - and that it is unpredictable (by me, anyway) what it will do (you
have to dump out the array and trial-and-error it to get it to do what you want). It promises more than it delivers. I have much the same comments
to make about the similar functionality in Tcl (Expect).

None of which is criticism of the feature; as you say below, it basically
does as much as the underlying regexp library allows it to do.

...

I think I'll do the parsing the straightforward two-step way as I did
before the GNU Awk specific functions were available; it's probably
also the clearest way to program that functionality.

Probably so. BTW, it is not really "GNU Awk specific"; lots of languages
have this general capability.

Incidentally, here is a function of mine that uses match's 3rd arg. I find
it useful. This addresses a common AWK issue, where you have a line with fields (in the usual AWK whitespace-delimited sense), but you need to know
the actual character positions of the fields (since they can move around
from line to line of input). Note also that I'm not really sure where the
name "splitMatch" came from; it was just what popped into my head when I
was writing this...

--- Cut Here ---
# Find the character positions of each of the fields in string s.
# Note that s will usually be $0, and n will usually be NF.
function splitMatch(s,n,A, i,t) {
for (i=1; i<=n; i++) t = t "([^ \t]+)[ \t]*"
return match(s,t,A)
}
--- Cut Here ---

--
In the corner of the room on the ceiling is a large vampire bat who
is obviously deranged and holding his nose.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Kenny McCormack on Thu Apr 10 23:39:57 2025

On 10.04.2025 16:04, Kenny McCormack wrote:

In article <vt8bit$2uiq5$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...

I have to admit that I (still) don't really understand how this match third >>> arg stuff works.

...

I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to >>> use it.

...

Above output stuff appears because in 'arr' there's additional elements
about the pattern positions stored.

Just to clarify, I wasn't looking for a tutorial (man page regurgitation).
I understand the man page description of match's 3rd arg as well as anyone;

(I didn't mean to offend you. Sorry, if it appeared so. - I just
read you writing "I don't really understand how this [...] works",
and that "it is unpredictable", so I thought some descriptive words
may be useful.)

I just find it that it doesn't do as much in practice as (I think) it
should - and that it is unpredictable (by me, anyway) what it will do (you have to dump out the array and trial-and-error it to get it to do what you want).

It is pretty understandable to me, and not the least unpredictable.
(That's why I thought it would be okay to write what I had written
to explain it.) I don't understand what you find to be unpredictable.
But never mind.

It promises more than it delivers.

Yes, probably. Although, according to what's literally documented,
it doesn't promise too much, IMO. The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

[...]

None of which is criticism of the feature; as you say below, it basically does as much as the underlying regexp library allows it to do.

...

I think I'll do the parsing the straightforward two-step way as I did
before the GNU Awk specific functions were available; it's probably
also the clearest way to program that functionality.

Probably so. BTW, it is not really "GNU Awk specific"; lots of languages have this general capability.

Oh, I was just trying to say that for my programming the standard Awk
functions (as opposed to GNU Awk _specific_ functions) are fine here.
(That should not disdain all the useful GNU Awk extensions existing.)

Janis

[...]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Aharon Robbins on Fri Apr 11 09:10:55 2025

On 11.04.2025 08:33, Aharon Robbins wrote:

In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.

Yes, that's why I had thought about a 2-dimensional array [on GNU
Awk level] so that arr[n][i] for i=1..z would contain the patterns.
This is what I actually tried with GNU Awk (before I had asked you)
to see whether there's some undocumented feature.

(I'm aware that things may get quite complicated if there's some
restrictions imposed (on "C"-level or else) which are in the way.)

Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.

I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
about this question. We shall see what develops.

Oh, thanks for that. - My expectation had just been to check whether
such a feature is already available in GNU Awk (or could in a simple
way be made available with little effort). So I'm indeed interested
to hear whether that is a feasible and sensible feature. - Myself,
I have to admit, haven't yet thoroughly thought through about such a
feature. I've just seen it from the limited view of my application
context and thought it could be a worthwhile extension/generalization.

Janis

Arnold

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Aharon Robbins@21:1/5 to janis_papanagnou+ng@hotmail.com on Fri Apr 11 06:33:19 2025

In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the
subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.

Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.

I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
about this question. We shall see what develops.

Arnold

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Aharon Robbins on Fri Apr 11 07:40:01 2025

On 2025-04-11, Aharon Robbins <arnold@freefriends.org> wrote:

In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g., >>another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.

Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.

I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
about this question. We shall see what develops.

Unix and POSIX regular expressions have perpetrated a kind of
misfeature. They took the purely algebraic parentheses described in
classic literature on regular expressions, whose only role is to
override the precedence and associativity of operators, and turned them
into active operators that perform a double duty: they still override precedence, but also denote submatches associated with capture
registers.

Parentheses are enumerated and made to correspond with numbered capture registers, I think, as follows:

( ( ) ( ( ) ) )
1 2 3 4

Scanning left to right, we identify the open left parentheses
which have matching closing parentheses, and number these in order
starting from 1.

There is a convention that capture register 0 is reserved for
the full match for the expression. This is how it is with
the array reported by POSIX's regexec. Thus the numbering is
one based.

The POSIX standard clearly says what happens when a parenthesized
subexpression matches something more than once.

This is spelled out in the documentation page on the regcomp,
regexec and regfree functions. Look for this text:

"If subexpression i in a regular expression is not contained within
another subexpression, and it participated in the match several times,
then the byte offsets in pmatch[ i] shall delimit the last such match.

This is exactly the last match behavior observed by Janis in Awk's
match function.

Basically, subexpressions are dumb hack. As the regex automaton
traverses through its states in response to the input, it triggers
some anchor points associated with the original subexpression,
which copy some data, or keep track of some pointers to the start and
end of the match. When the submatch is complete, there is a data
transfer which clobbers any previous such a data transfer.

There are some tricky rules nested expressions.
Suppose that we have:

( ... ( ... ) ...)
1 2

2 is nested inside 1. Suppose that 1 matches multiple times.
Clearly, the corresponding register is left with the most
recent match when the matching is done.

But suppose that subexpression 2 sometimes matches when 1
matches, but sometimes doe snot match when 1 matches.

I think the obscurely worded POSIX rules are trying to prevent an inconsistency.

In a nutshell, if a string is reported in register 2 from
matching subexpression 2, it has to be a substring of a match that is concurrently happening for subexpression 1.

Now suppose that that an iteration of 1 matches something,
but in that iteration, subexpression 2 does not match.
Then 2 has to be reset to indicate that it didn't match anything.

Probably, it's a good idea to implement the behavior follows: whenever a
new capture iteration begins for 1, the register for 2 must also be
cleared, so that it doesn't retain stale data in the event that a match
for 2 is not encountered in the new iteration of 1.

This stuff is not really that usable for repetition; captures
were clearly envisioned mainly for non-repeating matching without
any kleene stars or {m, n} repetitions.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to Aharon Robbins on Fri Apr 11 08:57:22 2025

In article <67f8b7af$0$705$14726298@news.sunsite.dk>,
Aharon Robbins <arnold@freefriends.org> wrote:
...

Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.

Just out of curiosity, does the new matcher address the issue raised by
Janis?

It sounds like you are implying that it does, but do not say so explicitly.

Again, just curiosity. I remember when you announced the new matcher, and
it sounded interesting, but the presentation left me wondering the usual question(s): Why should I care? (Why should I get excited about this?)

Incidentally, I remember that the primary issue with the new matcher was
that it was written in C++. It needs to be C-ified in order to be included
in a GAWK release version.

--
If the automobile had followed the same development cycle as the
computer, a Rolls-Royce today would cost $100, get a million miles to
the gallon, and explode once every few weeks, killing everyone inside.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Janis Papanagnou on Fri Apr 11 08:22:44 2025

On 2025-04-11, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 11.04.2025 08:33, Aharon Robbins wrote:

In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the
subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.

Yes, that's why I had thought about a 2-dimensional array [on GNU
Awk level] so that arr[n][i] for i=1..z would contain the patterns.
This is what I actually tried with GNU Awk (before I had asked you)
to see whether there's some undocumented feature.

I solved this problem 15 years ago in the TXR Pattern Language

$ echo 'R=r1,R=r2,R=r3,E=e' | txr -B -c '@(coll)R=@r,@(until)E@(end)E=@e' r[0]="r1"
r[1]="r2"
r[2]="r3"
e="e"

We can eval the output into Bash and have a ${r[@]} array.

We can see the captured variables in a Lisp format:

$ echo 'R=r1,R=r2,R=r3,E=e' | txr -l -c '@(coll)R=@r,@(until)E@(end)E=@e'
(r "r1" "r2" "r3")
(e . "e")

The matches occuring in repetition constructs like @(coll) or its
vertical, line-oriented counterpart @(collect), are automatically
tabulated into lists.

We can see that the "e" variable wasn't; it is string valued,
rather than list valued.

One possibility is to use the @(merge dest {sources}*) directive which
examines different nesting depths of its operands and
intelligently combines them.

$ echo 'R=r1,R=r2,R=r3,E=e' | txr -B -c '@(coll)R=@r,@(until)E@(end)E=@e @(merge x r e)'
r[0]="r1"
r[1]="r2"
r[2]="r3"
e="e"
x[0]="r1"
x[1]="r2"
x[2]="r3"
x[3]="e"

$ echo 'R=r1,R=r2,R=r3,E=e' | txr -B -c '@(coll)R=@r,@(until)E@(end)E=@e @(merge x r e)
@(forget r e)'
x[0]="r1"
x[1]="r2"
x[2]="r3"
x[3]="e"

A plethora of techniques are possible.

In Lisp, Split data along commas, then again on =

(flow "R=r1,R=r2,R=r3,E=e"

(spl ","))
("R=r1" "R=r2" "R=r3" "E=e")

(flow "R=r1,R=r2,R=r3,E=e"

(spl ",")
(map (op spl "=")))
(("R" "r1") ("R" "r2") ("R" "r3") ("E" "e"))

Or pattern match the comma splits:

(flow "R=r1,R=r2,R=r3,E=e"

(spl ",")
(map (do match `@key=@val` @1 (list key val))))
(("R" "r1") ("R" "r2") ("R" "r3") ("E" "e"))

Just the R's please

(flow "R=r1,R=r2,R=r3,E=e"

(spl ",")
(map (do if-match `R=@val` @1 val)))
("r1" "r2" "r3" nil)

Splice out the nils:

(flow "R=r1,R=r2,R=r3,E=e"

(spl ",")
(mappend (do if-match `R=@val` @1 (list val))))
("r1" "r2" "r3")

Or remove them:

(flow "R=r1,R=r2,R=r3,E=e"

(spl ",")
(map (do if-match `R=@val` @1 val))
(remq nil))

Heck, use a Lispified Awk. The variable f holds
the fields. Whenw e assign f to itself, that
forces the recalculation of variable rec with
the ofs:

(awk (:inputs '("R=r1,R=r2,R=r3,E=e"))

(:set fs "," ofs ":")
(t (set f f) (prn)))
R=r1:R=r2:R=r3:E=e
nil

Use two Awks, nested inside each other: inner Awk
processes the fields f produced by the outer Awk:

(awk (:inputs '("R=r1,R=r2,R=r3,E=e"))

(:set fs "," ofs ":")
(t (awk (:inputs f)
(:set fs "=")
(t (prn [f 1])))))
r1
r2
r3
e
nil

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Kenny McCormack on Fri Apr 11 15:50:11 2025

On 11.04.2025 10:57, Kenny McCormack wrote:

In article <67f8b7af$0$705$14726298@news.sunsite.dk>,
Aharon Robbins <arnold@freefriends.org> wrote:
...

Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.

Just out of curiosity, does the new matcher address the issue raised by Janis?

I read his post as if he put it under discussion ("I just opened an
issue, [...] about this question. We shall see what develops.") and
the provided link shows this as well.[*]

(I don't see the answers, though, since my browser obviously doesn't
support the web-page's (dynamic?) format. - So I cannot tell what the
state of that discussion is.)

It sounds like you are implying that it does, but do not say so explicitly.

[...]

Janis

[*] From https://github.com/mikehaertel/minrx/issues/43:

So there are two questions.

Is it theoretically possible to capture all the instances of
subexpressions matched by the interval expression?

Can this be brought out into the code? I understand it would take an extended API with a richer data structure in order to do this. gawk's
extended version of the match() function could then be (somehow)
extended to take advantage of this feature.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Aharon Robbins on Fri Apr 11 17:54:07 2025

On 2025-04-11, Aharon Robbins <arnold@freefriends.org> wrote:

In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g., >>another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.

Here is what I believe is the right requirement, if you want repeatedly
visited subexpressions to capture all their iterations.

The dimensionality has to be such that the entire array of matches is
versioned as a whole.

In other words, abstractly, we have

matches[history][register]

where history counts from 0, that being the latest matches.
register also goes from zero; [0] is the match for the entire
expression, [1] for subexpression 1 and so on.

Any time there is a repetition in any subexpression, matches[0]
is duplicated and pushed into the history.

We can imagine the matches[h][0..(n-1)] giving a trace of the
matches through the tree of subexpressions, from root to leaf.
Each time someting is matched, the entire trace is recorded
in the history, so everything is consistent.

Say we want to parse the syntax

key=v1,v2,v3 foo=a,b

Using something like :

([^ =]+=([^ ,]*,?)* *)*
1 2

Then we have the subgroups 1 and 2. We would like to end up with
a two dimensional match array like this:

match[hist][reg] =

reg

hist 0 1 2

0 key=v1,v2,v3 foo=a,b foo=a,b b

1 key=v1,v2,v3 foo=a,b foo=a,b a,

2 key=v1,v2,v3 foo=a,b key=v1,v2,v3 v3

3 key=v1,v2,v3 foo=a,b key=v1,v2,v3 v2,

4 key=v1,v2,v3 foo=a,b key=v1,v2,v3 v1,

This gives us the raw trace snashpot data from which a tree could be
built using a simple algorithm (say, still in the order of leftmost
being more recent match):

"key=v1,v2,v3 foo=a,b"
/ \
"foo=a,b" "key=v1,v2,v3"
/ \ / | \
"b" "a," "v3" "v2," "v1,"

This structure provides more logical access.

Anyway, I feel this problem is better solved using approaches
that avoid regexes, or that use regexes for just some low-level
tokenizing.

With my above regex, there are stray commas in the items,
because they had to be included in the repetition, and there
is no nice way to exclude them without adding another level
of parentheses.

Each time we play with the parentheses, we radically change
the structure and size of the output.

It just ends up a wrongheaded academic exercise.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to mcollado2011@gmail.com on Fri Apr 18 12:01:18 2025

In article <vtt813$2ovai$1@dont-email.me>,
Manuel Collado <mcollado2011@gmail.com> wrote:
...

A 2-dimensional array is not strictly necessary. It could be possible to
keep the one dimensional array interface and use the same trick for >multidimensional arrays indices in Posix AWK. I.e., return a list of
matched values delimited by SUBSEP.

But why would you want to?

GAWK has multidimensional arrays; they should be used.
--
(Cruz certainly has an odd face) ... it looks like someone sewed pieces of a waterlogged Reagan mask together at gunpoint ...

http://www.rollingstone.com/politics/news/how-america-made-donald-trump-unstoppable-20160224

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Manuel Collado on Fri Apr 18 14:24:24 2025

On 18.04.2025 12:03, Manuel Collado wrote:

El 11/4/25 a las 9:10, Janis Papanagnou escribió:

On 11.04.2025 08:33, Aharon Robbins wrote:

In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the
subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.

Yes, that's why I had thought about a 2-dimensional array [on GNU
Awk level] so that arr[n][i] for i=1..z would contain the patterns.
This is what I actually tried with GNU Awk (before I had asked you)
to see whether there's some undocumented feature.

A 2-dimensional array is not strictly necessary. It could be possible to
keep the one dimensional array interface and use the same trick for multidimensional arrays indices in Posix AWK. I.e., return a list of
matched values delimited by SUBSEP.

Yes, of course. - My suggestion for using a 2-dimensional array was
suggested only because it's IMO simpler to process and access. And
given that the considered potentially new functionality would have
been non-standard would not hinder that also the match() function
(with the new logic) could use GNU Awk's non-standard 2-dimensional
arrays.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

Recent Visitors

System Info

Re: Experiences with match() subexpressions?