Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 43 |
Nodes: | 6 (0 / 6) |
Uptime: | 108:40:33 |
Calls: | 290 |
Files: | 905 |
Messages: | 76,699 |
The problem is that if the above is in a function, when you list out the function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd rather see "\n".
Is there any way to get this?
Note: this is just a question of aesthetics. Functionally, it all works as expected.
Sample bash code:
f="$(fortune)" # Get some multi-line output into "f"
# Look for foo followed by bar on the same line
[[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"
The point is you need the "anything other than a newline" or else it might match foo on one line and bar on a later line. The above is the only way I could figure out to express a newline in the particular flavor of reg exps used by the =~ operator.
The problem is that if the above is in a function, when you list out the function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd rather see "\n".
Is there any way to get this?
Not sure this really addresses your 'type funcName' query but maybe
somewhat better output from 'type funcName' ? :
...
regex=$(printf 'foo[^$\n]*bar')
[[ "$f" =~ $regex ]] && echo "foo bar"
Kind of wish the regex string could be bracketed by "/" as in awk.
On 2024-07-22, Kenny McCormack <gazelle@shell.xmission.com> wrote:
The problem is that if the above is in a function, when you list out the
function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd
rather see "\n".
I see what you mean:
$ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ set | grep -A 4 '^test'
test ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}
Is there any way to get this?
[...]
Both (ksh & zsh) seem to show "better aesthetics".
Too bad it doesn't help for your bash context.
In article <v7nu8t$15bon$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...
Both (ksh & zsh) seem to show "better aesthetics".
Indeed, it does. That is how it should work.
On 23.07.2024 13:46, Kenny McCormack wrote:
In article <v7nu8t$15bon$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...
Both (ksh & zsh) seem to show "better aesthetics".
Indeed, it does. That is how it should work.
BTW, it's interesting that bash and zsh both reformat (sort
of pretty-print) the code (when using 'typeset -f'), only
that zsh keeps that literal '\n'. This may show a way (by
zsh example) how to follow Kaz' suggestion of patching the
bash. (But, frankly, I'm not sure it was meant seriously. (see ** below))
But ksh displays it as it had been typed in; a raw format.
If you define your function, say, as multi-line code you
also see it that way, there's no processing at that point
(or the original retained as copy). I didn't expect that.
In article <v7ofkl$18d66$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
On 23.07.2024 13:46, Kenny McCormack wrote:
One thing that bash does that's annoying is puts semicolons on the end of (almost) every line.
I have, on occasion, had to recover a function from
the bash pretty print (*), and one of the things that needs to be done is
to remove those extraneous semicolons.
(*) BTW, the command I use is "type". I.e., "type funName" displays the function definition of function funName. That seems to be the same as your use of "typeset".
But ksh displays it as it had been typed in; a raw format.
If you define your function, say, as multi-line code you
also see it that way, there's no processing at that point
(or the original retained as copy). I didn't expect that.
Yep. Note also that bash reformats something like:
cmd1 &&
cmd2 &&
cmd3
to:
cmd1 && cmd2 && cmd3
which is annoying.
(**) I've hacked the bash source code for less. So, yeah, it is possible.
Indeed. It reminds me the philosphy that I often noticed in MS (and
nowadays also in Linux software, sadly) contexts; they seem to think
their auto-changes are better than the intention of the programmer.
Which all kind of echoes back to the other recent thread in this NG about regular expressions vs. globs. The cold hard fact is that there really is
no such thing as "regular expressions" (*), since every language, every program, every implementation of them, is quite different.
(*) As an abstract concept, separate from any specific implementation.
On 23.07.2024 00:47, Kaz Kylheku wrote:
On 2024-07-22, Kenny McCormack <gazelle@shell.xmission.com> wrote:
The problem is that if the above is in a function, when you list out the >>> function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd
rather see "\n".
I see what you mean:
$ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ set | grep -A 4 '^test'
test ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}
Is there any way to get this?
Of course (and out of curiosity) I tried that display detail as well
in Kornshell to see how it behaves, and using a different command to
display it...
With my (old?) bash:
$ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ typeset -f f
f ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}
The same with ksh:
$ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ typeset -f f
f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
And for good measure also in zsh:
% f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
% typeset -f f
f () {
[[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"
}
On 2024-07-23, Kenny McCormack <gazelle@shell.xmission.com> wrote:
Which all kind of echoes back to the other recent thread in this NG about
regular expressions vs. globs. The cold hard fact is that there really is >> no such thing as "regular expressions" (*), since every language, every
program, every implementation of them, is quite different.
(*) As an abstract concept, separate from any specific implementation.
Yes, there are regular expressions as an abstract concept. They are part
of the theory of automata. Much of the research went on up through the 1960's. The * operator is called the "Kleene star". https://en.wikipedia.org/wiki/Kleene_star
In the old math/CS papers about regular expressions, regular expressions
are typically represented in terms of some input symbol alphabet
(usually just letters a, b, c ...) and only the operators | and *,
and parentheses (other than when advanced operators are being discussed,
like intersection and complement, whicha re not easily constructed from these.)
I think character classes might have been a pragmatic invention in
regex implementations. The theory doesn't require [a-c] because
that can be encoded as (a|b|c).
The ? operator is not required because (R)? can be written (R)(R)*.
Escaping is not required because the oeprators and input symbols are distinct; the idea that ( could be an input symbol is something that
occurs in implementations, not in the theory.
Regex implementors take the theory and adjust it to taste,
and add necessary details such as character escape sequences for
control characters, and escaping to allow the oeprator characters
themselves to be matched. Plus character classes, with negation
and ranges and all that.
Not all implementations follow solid theory. For instance, the branch operator | is supposed to be commutative. There is no difference
between R1|R2 and R2|R1. But in many implementations (particularly backtracking ones like PCRE and similar), there is a difference: these implementations implement R1|R2|R3 by trying the expressions in left to right order and stop at the first match.
This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
Kaz Kylheku <643-408-1753@kylheku.com> writes:
This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.
In article <v7omtd$19ng6$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...
Indeed. It reminds me the philosphy that I often noticed in MS (and
nowadays also in Linux software, sadly) contexts; they seem to think
their auto-changes are better than the intention of the programmer.
The overall plan is to turn programming into a minimum wage job. That's
why they are starting to call it "coding" and make it sound like something anybody can do.
So, they have to take as much as possible of the choice/initiative out of it. Make it the modern equivalent of a factory job.
[...]
In the old math/CS papers about regular expressions, regular expressions
are typically represented in terms of some input symbol alphabet
(usually just letters a, b, c ...) and only the operators | and *,
and parentheses (other than when advanced operators are being discussed,
like intersection and complement, whicha re not easily constructed from these.)
I think character classes might have been a pragmatic invention in
regex implementations. The theory doesn't require [a-c] because
that can be encoded as (a|b|c).
The ? operator is not required because (R)? can be written (R)(R)*.
Escaping is not required because the oeprators and input symbols are distinct; the idea that ( could be an input symbol is something that
occurs in implementations, not in the theory.
Regex implementors take the theory and adjust it to taste,
and add necessary details such as character escape sequences for
control characters, and escaping to allow the oeprator characters
themselves to be matched. Plus character classes, with negation
and ranges and all that.
Not all implementations follow solid theory. For instance, the branch operator | is supposed to be commutative. There is no difference
between R1|R2 and R2|R1. But in many implementations (particularly backtracking ones like PCRE and similar), there is a difference: these implementations implement R1|R2|R3 by trying the expressions in left to right order and stop at the first match.
This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:
Kaz Kylheku <643-408-1753@kylheku.com> writes:
This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.
Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.
Broken regular expressions identify the longest prefix, except
when the | operator is used; then they just identify a prefix,
not necessarily longest.
Kaz Kylheku <643-408-1753@kylheku.com> writes:
On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:
Kaz Kylheku <643-408-1753@kylheku.com> writes:
This matters when regexes are used for matching a prefix of the input; >>>> if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.
Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.
What is "the input" in the set view. The set view is simply a recursive definition of the language.
Broken regular expressions identify the longest prefix, except
when the | operator is used; then they just identify a prefix,
not necessarily longest.
What is a "broken" RE in the set view?
On 2024-07-24, Ben Bacarisse <ben@bsb.me.uk> wrote:
Kaz Kylheku <643-408-1753@kylheku.com> writes:
On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:
Kaz Kylheku <643-408-1753@kylheku.com> writes:
This matters when regexes are used for matching a prefix of the input; >>>>> if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive >>>> set of rules. The whole idea of an RE matching a string is from their >>>> use in practical applications.
Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.
What is "the input" in the set view. The set view is simply a recursive
definition of the language.
It is a separate string under consideration.
We have a set, and are asking the question "what is the longest prefix
of the given string which is a member of the set".
Broken regular expressions identify the longest prefix, except
when the | operator is used; then they just identify a prefix,
not necessarily longest.
What is a "broken" RE in the set view?
Inconsistency in being able to answer the question "what is the longest prefix of the string which is a member of the set".
Broken regexes contain a pitfall: they deliver the right answer
for expressions like ab*. If the input is "abbbbbbbc",
they identify the entire "abbbbbbb" prefix. But if the branch
operator is used, as in "a|ab*", oops, they short-circuit.
The "a" matches a prefix of the input, and so that's done; no need
to match the "ab*" part of the branch.
The "a" prefix is in the language described from the language; a
set element has been identified. But it's not the longest one.
It is an inconsistency. If the longest match is not required, why
bother finding one for "ab*"; for that expression, the "a" prefix could
also just be returned.