• Re: Correct syntax for pathological re.search()

    From Peter J. Holzer@21:1/5 to AVI GROSS via Python-list on Sat Oct 12 12:59:58 2024
    On 2024-10-11 17:13:07 -0400, AVI GROSS via Python-list wrote:
    Is there some utility function out there that can be called to show what the regular expression you typed in will look like by the time it is ready to be used?

    I assume that by "ready to be used" you mean the compiled form?

    No, there doesn't seem to be a way to dump that. You can

    p = re.compile("\\\\sout{")
    print(p.pattern)

    but that just prints the input string, which you could do without
    compiling it first.

    But - without having looked at the implementation - it's far from clear
    that the compiled form would be useful to the user. It's probably some
    kind of state machine, and a large table of state transitions isn't very readable.

    There are a number of websites which visualize regular expressions.
    Those are probably better for debugging a regular expression than
    anything the re module could reasonably produce (although with the
    caveat that such a web site would use a different implementation and
    therefore might produce different results).

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmcKVqgACgkQ8g5IURL+ KF3UaxAAgcW8z/AEatfZ8rsmA3Xw2TLE6Uc/33Em3+4iUHNNaDZwYEYCs9InNviq kmChc5eqpvbdbHwzSo6nMRUKoWIff8LWYTSjvk0/eFTngzP5nS87cUgCKqd2AlFr oH+tmxtRBShy6gJw8Zp9nZo4eMyk2jDrAkWrPRnM78WJ1XR6EgQ/xQtNEfLJDnEv wLczdzhg9Q5yxAZkcx/+NMf1kCtkSszR2f05lglLgoKhcMK45d71XWtRJaVpOHY4 y/k+avJT7I7OTTR0rEdCJ9Plb6z9tEtkcsSOD6Nk2CyaTt/UNcrRLN/oE6EKZmnm YWnmkTMKtVhM8LLi7/KzThkY8celLwEfDdh7yvZKh1pcVabf+YvSY/A/MxBwDpQS G6xbugimLDv4eY8dYtjgC3E3UYlpELOb4hfMbrJ9sbXKevLUh5HwQGLDY+psJsYx FRtACWb/MLmj8SaFFFe60DUigx6JLEJCPLanAtuo+PRIfigDRtnbSP4awctvULRY Q5bjftnbLiR7ZUvuZTaRxF2vHBV4a2EQCGIbzqzDoM5bt1cMjj60H+VKG4+2QeYR +x4lj/7gKywo9aVpPFT9ppfLR2Dyd9wZielnRBAc6QckoYVBow3eGIKZRq8F16gO kmibo1lUfcc+rW2n49dKhRme8T6eZ8yzPJUntnI
  • From Stefan Ram@21:1/5 to Peter J. Holzer on Sat Oct 12 11:59:52 2024
    "Peter J. Holzer" <hjp-python@hjp.at> wrote or quoted:
    But - without having looked at the implementation - it's far from clear
    that the compiled form would be useful to the user.

    So, what he might be getting at with "compiled form" is a
    representation that's easy on the eyes for us mere mortals.

    You could, for instance, use colors to show the difference between
    object and meta characters. In that case, the regex "\**" would
    come out as "**", but the first "*" might be navy blue (on a white
    background), so just your run-of-the-mill object character, while
    the second one would be burgundy, flagging it as a meta character.

    So, simplified, that would be something like:

    import re
    import tkinter as tk
    import time

    def tokenize_regex( pattern ):
    tokens = []
    i = 0
    while i < len( pattern ):
    if pattern[ i ] == '\\':
    if i + 1 < len( pattern ):
    tokens.append( ( 'escaped', pattern[ i+1: i+2 ]))
    i += 2
    else:
    tokens.append( ('error', 'Incomplete escape sequence' ))
    i += 1
    elif pattern[i] == '*':
    tokens.append( ( 'repetition', '*' ))
    i += 1
    else:
    tokens.append( ( 'plain', pattern[ i ]))
    i += 1

    return tokens

    root = tk.Tk()
    root.configure( bg='white' )

    regex = r'\**'
    result = tokenize_regex( regex )

    for token_type, token_value in result:
    if token_type == 'plain' or token_type == 'escaped':
    tk.Label( root, text=token_value, font=( 'Arial', 40 ), fg='#4070FF', bg='white' ).pack( side='left' )
    elif token_type == 'repetition':
    tk.Label( root, text=token_value, font=( 'Arial', 40 ), fg='#C02000', bg='white' ).pack( side='left' )

    root.mainloop()

    .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Peter J. Holzer via Python-list on Sat Oct 12 08:51:57 2024
    On 10/12/2024 6:59 AM, Peter J. Holzer via Python-list wrote:
    On 2024-10-11 17:13:07 -0400, AVI GROSS via Python-list wrote:
    Is there some utility function out there that can be called to show what the >> regular expression you typed in will look like by the time it is ready to be >> used?

    I assume that by "ready to be used" you mean the compiled form?

    No, there doesn't seem to be a way to dump that. You can

    p = re.compile("\\\\sout{")
    print(p.pattern)

    but that just prints the input string, which you could do without
    compiling it first.

    It prints the escaped version, so you can see if you escaped the string
    as you intended. In this case, the print will display '\\sout{'. That's
    worth something.


    But - without having looked at the implementation - it's far from clear
    that the compiled form would be useful to the user. It's probably some
    kind of state machine, and a large table of state transitions isn't very readable.

    There are a number of websites which visualize regular expressions.
    Those are probably better for debugging a regular expression than
    anything the re module could reasonably produce (although with the
    caveat that such a web site would use a different implementation and therefore might produce different results).

    hp



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to MRAB via Python-list on Sat Oct 12 09:06:54 2024
    On 10/11/2024 8:37 PM, MRAB via Python-list wrote:
    On 2024-10-11 22:13, AVI GROSS via Python-list wrote:
    Is there some utility function out there that can be called to show
    what the
    regular expression you typed in will look like by the time it is ready
    to be
    used?

    Obviously, life is not that simple as it can go through multiple
    layers with
    each dealing with a layer of backslashes.

    But for simple cases, ...

    Yes. It's called 'print'. :-)

    There is section in the Python docs about this backslash subject. It's
    titled "The Backslash Plague" in

    https://docs.python.org/3/howto/regex.html

    You can also inspect the compiled expression to see what string it
    received after all the escaping:

    import re

    re_string = '\\w+\\\\sub'
    re_pattern = re.compile(re_string)

    # Should look as if we had used r'\w+\\sub'
    print(re_pattern.pattern)
    \w+\\sub


    -----Original Message-----
    From: Python-list <python-list-
    bounces+avi.e.gross=gmail.com@python.org> On
    Behalf Of Gilmeh Serda via Python-list
    Sent: Friday, October 11, 2024 10:44 AM
    To: python-list@python.org
    Subject: Re: Correct syntax for pathological re.search()

    On Mon, 7 Oct 2024 08:35:32 -0500, Michael F. Stemper wrote:

    I'm trying to discard lines that include the string "\sout{" (which is
    TeX, for those who are curious. I have tried:
       if not re.search("\sout{", line): if not re.search("\sout\{", line): >>>    if not re.search("\\sout{", line): if not re.search("\\sout\{",
       line):

    But the lines with that string keep coming through. What is the right
    syntax to properly escape the backslash and the left curly bracket?

    $ python
    Python 3.12.6 (main, Sep  8 2024, 13:18:56) [GCC 14.2.1 20240805] on
    linux
    Type "help", "copyright", "credits" or "license" for more information.
    import re
    s = r"testing \sout{WHADDEVVA}"
    re.search(r"\\sout{", s)
    <re.Match object; span=(8, 14), match='\\sout{'>

    You want a literal backslash, hence, you need to escape everything.

    It is not enough to escape the "\s" as "\\s", because that only takes
    care
    of Python's demands for escaping "\". You also need to escape the "\" for
    the RegEx as well, or it will read it like it means "\s", which is the
    RegEx for a space character and therefore your search doesn't match,
    because it reads it like you want to search for " out{".

    Therefore, you need to escape it either as per my example, or by using
    four "\" and no "r" in front of the first quote, which also works:

    re.search("\\\\sout{", s)
    <re.Match object; span=(8, 14), match='\\sout{'>

    You don't need to escape the curly braces. We call them "seagull wings"
    where I live.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Gilmeh Serda on Sun Oct 13 10:45:44 2024
    Gilmeh Serda <gilmeh.serda@nothing.here.invalid> wrote or quoted:
    You don't need to escape the curly braces.

    Here's the 411 on some gnarly regex characters:

    . matches any single character, except when it hits a new line
    ^ kicks things off at the start of the sequence
    $ wraps it up at the end
    * goes zero to infinity
    + one or more times
    ? maybe once, maybe not
    { starts a specific count, like {2} or {2,3}
    } ends such a count
    | either this or that
    \ flips the script on the next character's meaning
    ( drops in on a group
    ) bails out of the group
    [ paddles out to a character class
    ] rides the character class to shore

    .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Michael F. Stemper on Mon Oct 7 13:56:51 2024
    "Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:
    if not re.search("\\sout\{", line):

    So, if you're not down to slap an "r" before your string literals,
    you're going to end up doubling down on every backslash.

    Long story short, those double backslashes in your regex?
    They'll be quadrupling up in your Python string literal!

    main.py

    import re

    lines = r'''
    abcdef
    \sout{abcdef
    abcdef
    abc\sout{def
    abcdef
    abcdef\sout{
    abcdef
    '''.strip().split( '\n' )

    for line in lines:
    product = re.search( "\\\\sout\\{", line )
    if not product:
    print( line )

    stdout

    abcdef
    abcdef
    abcdef
    abcdef

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Michael F. Stemper on Mon Oct 7 14:32:06 2024
    "Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:
    For now, I'll use the "r" in a cargo-cult fashion, until I decide which >syntax I prefer. (Is there any reason that one or the other is preferable?)

    I'd totally go with the r-style notation!

    It's got one bummer though - you can't end such a string literal with
    a backslash. But hey, no biggie, you could use one of those notations:

    main.py

    path = r'C:\Windows\example' + '\\'

    print( path )

    path = r'''
    C:\Windows\example\
    '''.strip()

    print( path )

    stdout

    C:\Windows\example\
    C:\Windows\example\

    .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Stefan Ram on Mon Oct 7 15:43:59 2024
    On 2024-10-07, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
    "Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:
    For now, I'll use the "r" in a cargo-cult fashion, until I decide which >>syntax I prefer. (Is there any reason that one or the other is preferable?)

    I'd totally go with the r-style notation!

    It's got one bummer though - you can't end such a string literal with
    a backslash. But hey, no biggie, you could use one of those notations:

    main.py

    path = r'C:\Windows\example' + '\\'

    print( path )

    path = r'''
    C:\Windows\example\
    '''.strip()

    print( path )

    stdout

    C:\Windows\example\
    C:\Windows\example\

    .

    ... although of course in this example you should probably do neither of
    those things, and instead do:

    from pathlib import Path
    path = Path(r'C:\Windows\example')

    since in a Path the trailing '\' or '/' is unnecessary. Which leaves
    very few remaining uses for a raw-string with a trailing '\'...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to jak via Python-list on Mon Oct 21 21:10:49 2024
    On 2024-10-19 00:15:23 +0200, jak via Python-list wrote:
    Peter J. Holzer ha scritto:
    As a trivial example, the regular expressions r"\\sout{" and r"\\sout\{" are equivalent (the \ before the { is redundant). Yet
    re.compile(s).pattern preserves the difference between the two strings.

    Allow me to be fussy: r"\\sout{" and r"\\sout\{" are similar but not equivalent.

    They are. Both will match the 6 character string
    0005c \ REVERSE SOLIDUS
    00073 s LATIN SMALL LETTER S
    0006f o LATIN SMALL LETTER O
    00075 u LATIN SMALL LETTER U
    00074 t LATIN SMALL LETTER T
    0007b { LEFT CURLY BRACKET

    If you omit the backslash, the parser will have to determine if the
    graph is part of regular expression {n, m} and will take more time.

    Yes, that's the parser. But the result of parsing will be the same:
    The string will end in a literal backslash.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmcWpzMACgkQ8g5IURL+ KF3MwA//U/od9Ba5s4anGCrIUm7uGFdaBVW7/SWr/xgGdJ2flTGfBnYPvVrK30yn 2B/GBBEn81clGhfvy2Ec7BHNl6Jz2bhe3kLzTAHCMANFXK4SwgNNM1lPLQe5MZr9 hWNUnl7fneAOmS6ySacAiI5L1cVtbWRy7D6/kjTcCdD1HdLMlY/hh6WA6Wxo/cfL b12WvPToolCd1QozzoQxgHpvqMgYq9i0vfycgavB0OG2QQlAwD5KkYBfGKqoFGoo X8TJqzH86Ofkln1RKKe+hixhvGU7Ce40H7UpECAMFMzJvXdaVzqHXCceY5f/ma0a 3PTKTia7df/1p3b47PXwDsaU3wTuAxexwNypHDn+FYmRHIjCX29oeANNOzC12/gI ToLiitDTnwR9h3n0hKgxpL2GDkHQscoLkRJWSirzFaTzwI+u/X9OLP8xj9AlaWHm WCtOisIJsva0JHoHKyG+Ycuqvgki0H4ZHc2MD3h7cx8hcodzNVI8OGFeBDpGCGqh rFfTn6pb4hF3Oc5T5qN/+eZGHuYIOoRA7/JL0Ou3XpJi4iDSCRofkGMMpUzRYi9o fa/EC9Hre0uG9j9fWlltpCcwjfnzOCPVn5Loqai2kaxJJhf7bNit5G97Lq4mCQi8 ZM0YwF+14JDKKD7gZ5qbqvEjF3ofdjMqNNoqwcS
  • From Stefan Ram@21:1/5 to Peter J. Holzer on Mon Oct 21 20:24:49 2024
    "Peter J. Holzer" <hjp-python@hjp.at> wrote or quoted:
    On 2024-10-19 00:15:23 +0200, jak via Python-list wrote:
    Allow me to be fussy: r"\\sout{" and r"\\sout\{" are similar but not >>equivalent.
    . . .
    Yes, that's the parser. But the result of parsing will be the same:
    The string will end in a literal backslash.

    Functional reqs lay out what your system's got to do, while
    non-functional reqs are all about time and other resource
    constraints.

    When you're crunching through parsing, what pops out is
    your functional bread and butter.

    But the time it takes to chew through that data?
    That's non-functional and implementation-dependent territory.

    So, we can say they're functionally equivalent.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Pieter van Oostrum@21:1/5 to Stefan Ram on Tue Oct 8 19:50:14 2024
    ram@zedat.fu-berlin.de (Stefan Ram) writes:

    "Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:

    path = r'C:\Windows\example' + '\\'

    You could even omit the '+'. Then the concatenation is done at parsing time instead of run time.
    --
    Pieter van Oostrum <pieter@vanoostrum.org>
    www: http://pieter.vanoostrum.org/
    PGP key: [8DAE142BE17999C4]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MRAB@21:1/5 to Michael F. Stemper via Python-list on Tue Oct 8 20:11:40 2024
    On 2024-10-07 14:35, Michael F. Stemper via Python-list wrote:
    I'm trying to discard lines that include the string "\sout{" (which is TeX, for
    those who are curious. I have tried:
    if not re.search("\sout{", line):
    if not re.search("\sout\{", line):
    if not re.search("\\sout{", line):
    if not re.search("\\sout\{", line):

    But the lines with that string keep coming through. What is the right syntax to
    properly escape the backslash and the left curly bracket?

    String literals use backslash is an escape character, so it needs to be escaped, or you need to use a "raw" string.

    However, regex also uses backslash as an escape character.

    That means that a literal backslash in a regex that's in a plain string
    literal needs to be doubly-escaped, once for the string literal and
    again for the regex.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MRAB@21:1/5 to Karsten Hilbert via Python-list on Tue Oct 8 20:07:04 2024
    On 2024-10-08 19:30, Karsten Hilbert via Python-list wrote:
    Am Mon, Oct 07, 2024 at 08:35:32AM -0500 schrieb Michael F. Stemper via Python-list:

    I'm trying to discard lines that include the string "\sout{" (which is TeX, for
    those who are curious. I have tried:
    if not re.search("\sout{", line):
    if not re.search("\sout\{", line):
    if not re.search("\\sout{", line):
    if not re.search("\\sout\{", line):

    unwanted_tex = '\sout{'
    if unwanted_tex not in line: do_something_with_libreoffice()

    That should be:

    unwanted_tex = r'\sout{'

    or:

    unwanted_tex = '\\sout{'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Karsten Hilbert@21:1/5 to All on Tue Oct 8 20:30:34 2024
    Am Mon, Oct 07, 2024 at 08:35:32AM -0500 schrieb Michael F. Stemper via Python-list:

    I'm trying to discard lines that include the string "\sout{" (which is TeX, for
    those who are curious. I have tried:
    if not re.search("\sout{", line):
    if not re.search("\sout\{", line):
    if not re.search("\\sout{", line):
    if not re.search("\\sout\{", line):

    unwanted_tex = '\sout{'
    if unwanted_tex not in line: do_something_with_libreoffice()

    Karsten
    --
    GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Stefan Ram on Tue Oct 8 19:57:45 2024
    ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
    "\\\\chardef \\\\\\\\ = '\\\\\\\\".

    However, one can rewrite this as follows:

    "`chardef `` = '``".replace( "`", "\\"*4 )

    . One can also use "repr" to find how to represent something:

    main.py

    text = input( "What do you want me to represent as a literal? " )
    print( repr( text ))

    transcript

    What do you want me to represent as a literal? \\sout\{
    '\\\\sout\\{'

    . We can use "escape" and "repr" to find how to represent
    a regular expression for a literal text:

    main.py

    import re

    text = input( "Want the literal of an re for what text? " )
    print( repr( re.escape( text )))

    transcript

    Want the literal of an re for what text? \sout{
    '\\\\sout\\{'

    .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to MRAB on Tue Oct 8 19:32:04 2024
    MRAB <python@mrabarnett.plus.com> wrote or quoted:
    However, regex also uses backslash as an escape character.

    TeX also uses the backslash as an escape character:

    \chardef \\ = '\\

    , the regular expression to search exactly this:

    \\chardef \\\\ = '\\\\

    , and the Python string literal for that regular expression:

    "\\\\chardef \\\\\\\\ = '\\\\\\\\".

    . Must be a reason Markdown started to use the backtick!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Bawden@21:1/5 to All on Tue Oct 8 16:59:48 2024
    Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

    Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> tex = '\sout{'
    >>> tex
    '\\sout{'
    >>>

    Am I missing something ?

    You're missing the warning it generates:

    > python -E -Wonce
    Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> tex = '\sout{'
    <stdin>:1: DeprecationWarning: invalid escape sequence '\s'
    >>>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Karsten Hilbert@21:1/5 to All on Tue Oct 8 22:17:49 2024
    Am Tue, Oct 08, 2024 at 08:07:04PM +0100 schrieb MRAB via Python-list:

    unwanted_tex = '\sout{'
    if unwanted_tex not in line: do_something_with_libreoffice()

    That should be:

    unwanted_tex = r'\sout{'

    Hm.

    Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    tex = '\sout{'
    tex
    '\\sout{'


    Am I missing something ?

    Karsten
    --
    GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MRAB@21:1/5 to Alan Bawden via Python-list on Tue Oct 8 23:10:03 2024
    On 2024-10-08 21:59, Alan Bawden via Python-list wrote:
    Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

    Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> tex = '\sout{'
    >>> tex
    '\\sout{'
    >>>

    Am I missing something ?

    You're missing the warning it generates:

    > python -E -Wonce
    Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> tex = '\sout{'
    <stdin>:1: DeprecationWarning: invalid escape sequence '\s'
    >>>

    You got lucky that \s in invalid. If it had been \t you would've got a
    tab character.

    Historically, Python treated invalid escape sequences as literals, but
    it's deprecated now and will become an outright error in the future
    (probably) because it often hides a mistake, such as the aforementioned
    \t being treated as a tab character when the user expected it to be a
    literal backslash followed by letter t. (This can occur within Windows
    file paths written in plain string literals.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Karsten Hilbert@21:1/5 to All on Wed Oct 9 20:06:10 2024
    Am Tue, Oct 08, 2024 at 04:59:48PM -0400 schrieb Alan Bawden via Python-list:

    Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

    Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> tex = '\sout{'
    >>> tex
    '\\sout{'
    >>>

    Am I missing something ?

    You're missing the warning it generates:

    <stdin>:1: DeprecationWarning: invalid escape sequence '\s'

    I knew it'd be good to ask :-D

    Karsten
    --
    GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gilmeh Serda@21:1/5 to Michael F. Stemper on Fri Oct 11 14:43:56 2024
    On Mon, 7 Oct 2024 08:35:32 -0500, Michael F. Stemper wrote:

    I'm trying to discard lines that include the string "\sout{" (which is
    TeX, for those who are curious. I have tried:
    if not re.search("\sout{", line): if not re.search("\sout\{", line):
    if not re.search("\\sout{", line): if not re.search("\\sout\{",
    line):

    But the lines with that string keep coming through. What is the right
    syntax to properly escape the backslash and the left curly bracket?

    $ python
    Python 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    import re
    s = r"testing \sout{WHADDEVVA}"
    re.search(r"\\sout{", s)
    <re.Match object; span=(8, 14), match='\\sout{'>

    You want a literal backslash, hence, you need to escape everything.

    It is not enough to escape the "\s" as "\\s", because that only takes care
    of Python's demands for escaping "\". You also need to escape the "\" for
    the RegEx as well, or it will read it like it means "\s", which is the
    RegEx for a space character and therefore your search doesn't match,
    because it reads it like you want to search for " out{".

    Therefore, you need to escape it either as per my example, or by using
    four "\" and no "r" in front of the first quote, which also works:

    re.search("\\\\sout{", s)
    <re.Match object; span=(8, 14), match='\\sout{'>

    You don't need to escape the curly braces. We call them "seagull wings"
    where I live.

    --
    Gilmeh

    Sometimes I simply feel that the whole world is a cigarette and I'm the
    only ashtray.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MRAB@21:1/5 to AVI GROSS via Python-list on Sat Oct 12 01:37:55 2024
    On 2024-10-11 22:13, AVI GROSS via Python-list wrote:
    Is there some utility function out there that can be called to show what the regular expression you typed in will look like by the time it is ready to be used?

    Obviously, life is not that simple as it can go through multiple layers with each dealing with a layer of backslashes.

    But for simple cases, ...

    Yes. It's called 'print'. :-)


    -----Original Message-----
    From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Gilmeh Serda via Python-list
    Sent: Friday, October 11, 2024 10:44 AM
    To: python-list@python.org
    Subject: Re: Correct syntax for pathological re.search()

    On Mon, 7 Oct 2024 08:35:32 -0500, Michael F. Stemper wrote:

    I'm trying to discard lines that include the string "\sout{" (which is
    TeX, for those who are curious. I have tried:
    if not re.search("\sout{", line): if not re.search("\sout\{", line):
    if not re.search("\\sout{", line): if not re.search("\\sout\{",
    line):

    But the lines with that string keep coming through. What is the right
    syntax to properly escape the backslash and the left curly bracket?

    $ python
    Python 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805] on linux Type "help", "copyright", "credits" or "license" for more information.
    import re
    s = r"testing \sout{WHADDEVVA}"
    re.search(r"\\sout{", s)
    <re.Match object; span=(8, 14), match='\\sout{'>

    You want a literal backslash, hence, you need to escape everything.

    It is not enough to escape the "\s" as "\\s", because that only takes care
    of Python's demands for escaping "\". You also need to escape the "\" for
    the RegEx as well, or it will read it like it means "\s", which is the
    RegEx for a space character and therefore your search doesn't match,
    because it reads it like you want to search for " out{".

    Therefore, you need to escape it either as per my example, or by using
    four "\" and no "r" in front of the first quote, which also works:

    re.search("\\\\sout{", s)
    <re.Match object; span=(8, 14), match='\\sout{'>

    You don't need to escape the curly braces. We call them "seagull wings"
    where I live.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Thomas Passin via Python-list on Fri Oct 18 23:09:41 2024
    On 2024-10-12 08:51:57 -0400, Thomas Passin via Python-list wrote:
    On 10/12/2024 6:59 AM, Peter J. Holzer via Python-list wrote:
    On 2024-10-11 17:13:07 -0400, AVI GROSS via Python-list wrote:
    Is there some utility function out there that can be called to show what the
    regular expression you typed in will look like by the time it is ready to be
    used?

    I assume that by "ready to be used" you mean the compiled form?

    No, there doesn't seem to be a way to dump that. You can

    p = re.compile("\\\\sout{")
    print(p.pattern)

    but that just prints the input string, which you could do without
    compiling it first.

    It prints the escaped version,

    Did you mean the *un*escaped version? Well, yeah, that's what print
    does.

    so you can see if you escaped the string as you intended. In this
    case, the print will display '\\sout{'.

    print("\\\\sout{")
    will do the same.

    It seems to me that for any string s which is a valid regular expression
    (i.e. re.compile doesn't throw an exception)

    assert re.compile(s).pattern == s

    holds.

    So it doesn't give you anything you didn't already know.

    As a trivial example, the regular expressions r"\\sout{" and r"\\sout\{"
    are equivalent (the \ before the { is redundant). Yet
    re.compile(s).pattern preserves the difference between the two strings.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmcSzo0ACgkQ8g5IURL+ KF20bg//YkA5go+I97KeDcgF5HF/zFVmsfGJar8yPBWy9RLCmZDzjKx336GCKVbo 20N7AAXrgkTyh9uUOFaTp1J0uokntWjUPLSJKMGmfleLHYFJfbBFBDtt2HlGjCpV O7QFqBH0NsmGIh3zh1ZXn4k+GUnChOJia3AeJTRJynlm4ISB5gHqp/UUj+5NSW8T D8GFQW1b2qgzuU49paKuau2qun6j+Fk6gKNIoFGM1VsGQDuxnJ47nGFrB1ntcyH4 F72Ln4GQPEeEqWO8Zyo1lle29G11bxDJ9G73xrIrDj8YEdUm5wGkdwMlGBi8MiXR PWvpoRC84K9lKGrcZKqgxu+BCUcz2AtPO1rNYduFSm6qh5kjpScAfwqdDTfiW8kf nyjddWwq0i1FMjJ9YBJ0FQ5pQAJIvsHIZs+fPnB1cmJi1CnBjDCafBzbzT8W48AN klcDwAOQJoci1GphWut5/NKuk/tbqY7CiEsYbs6sCi6omIo5fQG/rnweAkP004Ar 7vtJXgc/X/DJr29Zg4Kh88/1MJk9AkKgZGjpD0OYZVFN1cuMqJYzjYsK7L1DaGMP aomO4/vX82pfvbs7IkUfK6LJYsHt+ww39iiBAAOiEwaCVH68oGPlfdLnYdpQEAqn Ls1SMKu+UijM6ClZf6Krng9aIElAMYyC2Rr0qAA
  • From jak@21:1/5 to All on Sat Oct 19 00:15:23 2024
    Peter J. Holzer ha scritto:
    As a trivial example, the regular expressions r"\\sout{" and r"\\sout\{"
    are equivalent (the \ before the { is redundant). Yet
    re.compile(s).pattern preserves the difference between the two strings.


    Hi,
    Allow me to be fussy: r"\\sout{" and r"\\sout\{" are similar but not equivalent. If you omit the backslash, the parser will have to determine
    if the graph is part of regular expression {n, m} and will take more
    time. In some online regexs have these results:

    r"\\sout{" : 1 match ( 7 steps, 620 μs )

    r"\\sout\{" : 1 match ( 7 steps, 360 μs )

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)