• Re: Correct syntax for pathological re.search()

    From MRAB@21:1/5 to Michael F. Stemper via Python-list on Tue Oct 8 20:11:40 2024
    On 2024-10-07 14:35, Michael F. Stemper via Python-list wrote:
    I'm trying to discard lines that include the string "\sout{" (which is TeX, for
    those who are curious. I have tried:
    if not re.search("\sout{", line):
    if not re.search("\sout\{", line):
    if not re.search("\\sout{", line):
    if not re.search("\\sout\{", line):

    But the lines with that string keep coming through. What is the right syntax to
    properly escape the backslash and the left curly bracket?

    String literals use backslash is an escape character, so it needs to be escaped, or you need to use a "raw" string.

    However, regex also uses backslash as an escape character.

    That means that a literal backslash in a regex that's in a plain string
    literal needs to be doubly-escaped, once for the string literal and
    again for the regex.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MRAB@21:1/5 to Karsten Hilbert via Python-list on Tue Oct 8 20:07:04 2024
    On 2024-10-08 19:30, Karsten Hilbert via Python-list wrote:
    Am Mon, Oct 07, 2024 at 08:35:32AM -0500 schrieb Michael F. Stemper via Python-list:

    I'm trying to discard lines that include the string "\sout{" (which is TeX, for
    those who are curious. I have tried:
    if not re.search("\sout{", line):
    if not re.search("\sout\{", line):
    if not re.search("\\sout{", line):
    if not re.search("\\sout\{", line):

    unwanted_tex = '\sout{'
    if unwanted_tex not in line: do_something_with_libreoffice()

    That should be:

    unwanted_tex = r'\sout{'

    or:

    unwanted_tex = '\\sout{'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Karsten Hilbert@21:1/5 to All on Tue Oct 8 20:30:34 2024
    Am Mon, Oct 07, 2024 at 08:35:32AM -0500 schrieb Michael F. Stemper via Python-list:

    I'm trying to discard lines that include the string "\sout{" (which is TeX, for
    those who are curious. I have tried:
    if not re.search("\sout{", line):
    if not re.search("\sout\{", line):
    if not re.search("\\sout{", line):
    if not re.search("\\sout\{", line):

    unwanted_tex = '\sout{'
    if unwanted_tex not in line: do_something_with_libreoffice()

    Karsten
    --
    GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Stefan Ram on Tue Oct 8 19:57:45 2024
    ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
    "\\\\chardef \\\\\\\\ = '\\\\\\\\".

    However, one can rewrite this as follows:

    "`chardef `` = '``".replace( "`", "\\"*4 )

    . One can also use "repr" to find how to represent something:

    main.py

    text = input( "What do you want me to represent as a literal? " )
    print( repr( text ))

    transcript

    What do you want me to represent as a literal? \\sout\{
    '\\\\sout\\{'

    . We can use "escape" and "repr" to find how to represent
    a regular expression for a literal text:

    main.py

    import re

    text = input( "Want the literal of an re for what text? " )
    print( repr( re.escape( text )))

    transcript

    Want the literal of an re for what text? \sout{
    '\\\\sout\\{'

    .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to MRAB on Tue Oct 8 19:32:04 2024
    MRAB <python@mrabarnett.plus.com> wrote or quoted:
    However, regex also uses backslash as an escape character.

    TeX also uses the backslash as an escape character:

    \chardef \\ = '\\

    , the regular expression to search exactly this:

    \\chardef \\\\ = '\\\\

    , and the Python string literal for that regular expression:

    "\\\\chardef \\\\\\\\ = '\\\\\\\\".

    . Must be a reason Markdown started to use the backtick!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Bawden@21:1/5 to All on Tue Oct 8 16:59:48 2024
    Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

    Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> tex = '\sout{'
    >>> tex
    '\\sout{'
    >>>

    Am I missing something ?

    You're missing the warning it generates:

    > python -E -Wonce
    Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> tex = '\sout{'
    <stdin>:1: DeprecationWarning: invalid escape sequence '\s'
    >>>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Karsten Hilbert@21:1/5 to All on Tue Oct 8 22:17:49 2024
    Am Tue, Oct 08, 2024 at 08:07:04PM +0100 schrieb MRAB via Python-list:

    unwanted_tex = '\sout{'
    if unwanted_tex not in line: do_something_with_libreoffice()

    That should be:

    unwanted_tex = r'\sout{'

    Hm.

    Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    tex = '\sout{'
    tex
    '\\sout{'


    Am I missing something ?

    Karsten
    --
    GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MRAB@21:1/5 to Alan Bawden via Python-list on Tue Oct 8 23:10:03 2024
    On 2024-10-08 21:59, Alan Bawden via Python-list wrote:
    Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

    Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> tex = '\sout{'
    >>> tex
    '\\sout{'
    >>>

    Am I missing something ?

    You're missing the warning it generates:

    > python -E -Wonce
    Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> tex = '\sout{'
    <stdin>:1: DeprecationWarning: invalid escape sequence '\s'
    >>>

    You got lucky that \s in invalid. If it had been \t you would've got a
    tab character.

    Historically, Python treated invalid escape sequences as literals, but
    it's deprecated now and will become an outright error in the future
    (probably) because it often hides a mistake, such as the aforementioned
    \t being treated as a tab character when the user expected it to be a
    literal backslash followed by letter t. (This can occur within Windows
    file paths written in plain string literals.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Karsten Hilbert@21:1/5 to All on Wed Oct 9 20:06:10 2024
    Am Tue, Oct 08, 2024 at 04:59:48PM -0400 schrieb Alan Bawden via Python-list:

    Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

    Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> tex = '\sout{'
    >>> tex
    '\\sout{'
    >>>

    Am I missing something ?

    You're missing the warning it generates:

    <stdin>:1: DeprecationWarning: invalid escape sequence '\s'

    I knew it'd be good to ask :-D

    Karsten
    --
    GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gilmeh Serda@21:1/5 to Michael F. Stemper on Fri Oct 11 14:43:56 2024
    On Mon, 7 Oct 2024 08:35:32 -0500, Michael F. Stemper wrote:

    I'm trying to discard lines that include the string "\sout{" (which is
    TeX, for those who are curious. I have tried:
    if not re.search("\sout{", line): if not re.search("\sout\{", line):
    if not re.search("\\sout{", line): if not re.search("\\sout\{",
    line):

    But the lines with that string keep coming through. What is the right
    syntax to properly escape the backslash and the left curly bracket?

    $ python
    Python 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    import re
    s = r"testing \sout{WHADDEVVA}"
    re.search(r"\\sout{", s)
    <re.Match object; span=(8, 14), match='\\sout{'>

    You want a literal backslash, hence, you need to escape everything.

    It is not enough to escape the "\s" as "\\s", because that only takes care
    of Python's demands for escaping "\". You also need to escape the "\" for
    the RegEx as well, or it will read it like it means "\s", which is the
    RegEx for a space character and therefore your search doesn't match,
    because it reads it like you want to search for " out{".

    Therefore, you need to escape it either as per my example, or by using
    four "\" and no "r" in front of the first quote, which also works:

    re.search("\\\\sout{", s)
    <re.Match object; span=(8, 14), match='\\sout{'>

    You don't need to escape the curly braces. We call them "seagull wings"
    where I live.

    --
    Gilmeh

    Sometimes I simply feel that the whole world is a cigarette and I'm the
    only ashtray.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MRAB@21:1/5 to AVI GROSS via Python-list on Sat Oct 12 01:37:55 2024
    On 2024-10-11 22:13, AVI GROSS via Python-list wrote:
    Is there some utility function out there that can be called to show what the regular expression you typed in will look like by the time it is ready to be used?

    Obviously, life is not that simple as it can go through multiple layers with each dealing with a layer of backslashes.

    But for simple cases, ...

    Yes. It's called 'print'. :-)


    -----Original Message-----
    From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Gilmeh Serda via Python-list
    Sent: Friday, October 11, 2024 10:44 AM
    To: python-list@python.org
    Subject: Re: Correct syntax for pathological re.search()

    On Mon, 7 Oct 2024 08:35:32 -0500, Michael F. Stemper wrote:

    I'm trying to discard lines that include the string "\sout{" (which is
    TeX, for those who are curious. I have tried:
    if not re.search("\sout{", line): if not re.search("\sout\{", line):
    if not re.search("\\sout{", line): if not re.search("\\sout\{",
    line):

    But the lines with that string keep coming through. What is the right
    syntax to properly escape the backslash and the left curly bracket?

    $ python
    Python 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805] on linux Type "help", "copyright", "credits" or "license" for more information.
    import re
    s = r"testing \sout{WHADDEVVA}"
    re.search(r"\\sout{", s)
    <re.Match object; span=(8, 14), match='\\sout{'>

    You want a literal backslash, hence, you need to escape everything.

    It is not enough to escape the "\s" as "\\s", because that only takes care
    of Python's demands for escaping "\". You also need to escape the "\" for
    the RegEx as well, or it will read it like it means "\s", which is the
    RegEx for a space character and therefore your search doesn't match,
    because it reads it like you want to search for " out{".

    Therefore, you need to escape it either as per my example, or by using
    four "\" and no "r" in front of the first quote, which also works:

    re.search("\\\\sout{", s)
    <re.Match object; span=(8, 14), match='\\sout{'>

    You don't need to escape the curly braces. We call them "seagull wings"
    where I live.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Thomas Passin via Python-list on Fri Oct 18 23:09:41 2024
    On 2024-10-12 08:51:57 -0400, Thomas Passin via Python-list wrote:
    On 10/12/2024 6:59 AM, Peter J. Holzer via Python-list wrote:
    On 2024-10-11 17:13:07 -0400, AVI GROSS via Python-list wrote:
    Is there some utility function out there that can be called to show what the
    regular expression you typed in will look like by the time it is ready to be
    used?

    I assume that by "ready to be used" you mean the compiled form?

    No, there doesn't seem to be a way to dump that. You can

    p = re.compile("\\\\sout{")
    print(p.pattern)

    but that just prints the input string, which you could do without
    compiling it first.

    It prints the escaped version,

    Did you mean the *un*escaped version? Well, yeah, that's what print
    does.

    so you can see if you escaped the string as you intended. In this
    case, the print will display '\\sout{'.

    print("\\\\sout{")
    will do the same.

    It seems to me that for any string s which is a valid regular expression
    (i.e. re.compile doesn't throw an exception)

    assert re.compile(s).pattern == s

    holds.

    So it doesn't give you anything you didn't already know.

    As a trivial example, the regular expressions r"\\sout{" and r"\\sout\{"
    are equivalent (the \ before the { is redundant). Yet
    re.compile(s).pattern preserves the difference between the two strings.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmcSzo0ACgkQ8g5IURL+ KF20bg//YkA5go+I97KeDcgF5HF/zFVmsfGJar8yPBWy9RLCmZDzjKx336GCKVbo 20N7AAXrgkTyh9uUOFaTp1J0uokntWjUPLSJKMGmfleLHYFJfbBFBDtt2HlGjCpV O7QFqBH0NsmGIh3zh1ZXn4k+GUnChOJia3AeJTRJynlm4ISB5gHqp/UUj+5NSW8T D8GFQW1b2qgzuU49paKuau2qun6j+Fk6gKNIoFGM1VsGQDuxnJ47nGFrB1ntcyH4 F72Ln4GQPEeEqWO8Zyo1lle29G11bxDJ9G73xrIrDj8YEdUm5wGkdwMlGBi8MiXR PWvpoRC84K9lKGrcZKqgxu+BCUcz2AtPO1rNYduFSm6qh5kjpScAfwqdDTfiW8kf nyjddWwq0i1FMjJ9YBJ0FQ5pQAJIvsHIZs+fPnB1cmJi1CnBjDCafBzbzT8W48AN klcDwAOQJoci1GphWut5/NKuk/tbqY7CiEsYbs6sCi6omIo5fQG/rnweAkP004Ar 7vtJXgc/X/DJr29Zg4Kh88/1MJk9AkKgZGjpD0OYZVFN1cuMqJYzjYsK7L1DaGMP aomO4/vX82pfvbs7IkUfK6LJYsHt+ww39iiBAAOiEwaCVH68oGPlfdLnYdpQEAqn Ls1SMKu+UijM6ClZf6Krng9aIElAMYyC2Rr0qAA
  • From jak@21:1/5 to All on Sat Oct 19 00:15:23 2024
    Peter J. Holzer ha scritto:
    As a trivial example, the regular expressions r"\\sout{" and r"\\sout\{"
    are equivalent (the \ before the { is redundant). Yet
    re.compile(s).pattern preserves the difference between the two strings.


    Hi,
    Allow me to be fussy: r"\\sout{" and r"\\sout\{" are similar but not equivalent. If you omit the backslash, the parser will have to determine
    if the graph is part of regular expression {n, m} and will take more
    time. In some online regexs have these results:

    r"\\sout{" : 1 match ( 7 steps, 620 μs )

    r"\\sout\{" : 1 match ( 7 steps, 360 μs )

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)