Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 26 |
Nodes: | 6 (1 / 5) |
Uptime: | 65:24:14 |
Calls: | 482 |
Calls today: | 1 |
Files: | 1,072 |
Messages: | 96,348 |
I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?
Janis
On 10.04.2025 09:06, Janis Papanagnou wrote:
I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?
To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.
In article <vt7qs4$2gior$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
On 10.04.2025 09:06, Janis Papanagnou wrote:
I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?
To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.
I have to admit that I (still) don't really understand how this match third arg stuff works.
I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to use it.
I adapted your code into the following test script:
--- Cut Here ---
#!/bin/sh
gawk 'BEGIN {
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
for (i in arr) print i,arr[i]
}'
# To clarify; what I wanted is access of the values "r1", "r2", "r3",
# and "e" through 'arr'.
--- Cut Here ---
The output I get is:
--- Cut Here ---
0⌙start 1
0⌙length 18
3⌙start 18
1⌙start 11
2⌙start 13
3⌙length 1
2⌙length 2
1⌙length 5
0 R=r1,R=r2,R=r3,E=e
1 R=r3,
2 r3
3 e
--- Cut Here ---
After playing around a bit, I could not come up with any sensible way of getting what you want to get.
As an alternative, it sounds like you could just could just split the
string on the comma; that would get you:
R=r1
R=r2
R=r3
E=e
Or, for finer control, you could use patsplit().
...I have to admit that I (still) don't really understand how this match third >> arg stuff works.
...I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to >> use it.
Above output stuff appears because in 'arr' there's additional elements
about the pattern positions stored.
I think I'll do the parsing the straightforward two-step way as I did
before the GNU Awk specific functions were available; it's probably
also the clearest way to program that functionality.
In article <vt8bit$2uiq5$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...
...I have to admit that I (still) don't really understand how this match third >>> arg stuff works.
...I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to >>> use it.
Above output stuff appears because in 'arr' there's additional elements
about the pattern positions stored.
Just to clarify, I wasn't looking for a tutorial (man page regurgitation).
I understand the man page description of match's 3rd arg as well as anyone;
I just find it that it doesn't do as much in practice as (I think) it
should - and that it is unpredictable (by me, anyway) what it will do (you have to dump out the array and trial-and-error it to get it to do what you want).
It promises more than it delivers.
[...]
None of which is criticism of the feature; as you say below, it basically does as much as the underlying regexp library allows it to do.
...
I think I'll do the parsing the straightforward two-step way as I did
before the GNU Awk specific functions were available; it's probably
also the clearest way to program that functionality.
Probably so. BTW, it is not really "GNU Awk specific"; lots of languages have this general capability.
[...]
In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.
Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.
Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.
I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
about this question. We shall see what develops.
Arnold
The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.
In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g., >>another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.
Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.
Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.
I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
about this question. We shall see what develops.
Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.
On 11.04.2025 08:33, Aharon Robbins wrote:
In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.
Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the
subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.
Yes, that's why I had thought about a 2-dimensional array [on GNU
Awk level] so that arr[n][i] for i=1..z would contain the patterns.
This is what I actually tried with GNU Awk (before I had asked you)
to see whether there's some undocumented feature.
(flow "R=r1,R=r2,R=r3,E=e"(spl ","))
(flow "R=r1,R=r2,R=r3,E=e"(spl ",")
(flow "R=r1,R=r2,R=r3,E=e"(spl ",")
(flow "R=r1,R=r2,R=r3,E=e"(spl ",")
(flow "R=r1,R=r2,R=r3,E=e"(spl ",")
(flow "R=r1,R=r2,R=r3,E=e"(spl ",")
(awk (:inputs '("R=r1,R=r2,R=r3,E=e"))(:set fs "," ofs ":")
(awk (:inputs '("R=r1,R=r2,R=r3,E=e"))(:set fs "," ofs ":")
In article <67f8b7af$0$705$14726298@news.sunsite.dk>,
Aharon Robbins <arnold@freefriends.org> wrote:
...
Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.
Just out of curiosity, does the new matcher address the issue raised by Janis?
It sounds like you are implying that it does, but do not say so explicitly.
[...]
In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g., >>another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.
Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.
A 2-dimensional array is not strictly necessary. It could be possible to
keep the one dimensional array interface and use the same trick for >multidimensional arrays indices in Posix AWK. I.e., return a list of
matched values delimited by SUBSEP.
El 11/4/25 a las 9:10, Janis Papanagnou escribió:
On 11.04.2025 08:33, Aharon Robbins wrote:
In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.
Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the
subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.
Yes, that's why I had thought about a 2-dimensional array [on GNU
Awk level] so that arr[n][i] for i=1..z would contain the patterns.
This is what I actually tried with GNU Awk (before I had asked you)
to see whether there's some undocumented feature.
A 2-dimensional array is not strictly necessary. It could be possible to
keep the one dimensional array interface and use the same trick for multidimensional arrays indices in Posix AWK. I.e., return a list of
matched values delimited by SUBSEP.