Forum: Too Lazy BBS

Who's Online
Recent Visitors
- Mhmrules
  Thu Dec 19 22:44:03 2024
  from Floresville, Tx via Telnet
- Rixter
  Thu Dec 19 08:30:07 2024
  from Madison, Nc via SSH
- Apam
  Wed Dec 18 22:11:11 2024
  from Toowoomba, Qld via Telnet
- Stingray
  Wed Dec 18 18:24:34 2024
  from A-Net-Online.Lol via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	42
Nodes:	6 (0 / 6)
Uptime:	01:03:09
Calls:	220
Calls today:	1
Files:	824
Messages:	121,521
Posted today:	6

Re: Correct syntax for pathological re.search()

From Peter J. Holzer@21:1/5 to AVI GROSS via Python-list on Sat Oct 12 12:59:58 2024

On 2024-10-11 17:13:07 -0400, AVI GROSS via Python-list wrote:

Is there some utility function out there that can be called to show what the regular expression you typed in will look like by the time it is ready to be used?

I assume that by "ready to be used" you mean the compiled form?

No, there doesn't seem to be a way to dump that. You can

p = re.compile("\\\\sout{")
print(p.pattern)

but that just prints the input string, which you could do without
compiling it first.

But - without having looked at the implementation - it's far from clear
that the compiled form would be useful to the user. It's probably some
kind of state machine, and a large table of state transitions isn't very readable.

There are a number of websites which visualize regular expressions.
Those are probably better for debugging a regular expression than
anything the re module could reasonably produce (although with the
caveat that such a web site would use a different implementation and
therefore might produce different results).

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmcKVqgACgkQ8g5IURL+ KF3UaxAAgcW8z/AEatfZ8rsmA3Xw2TLE6Uc/33Em3+4iUHNNaDZwYEYCs9InNviq kmChc5eqpvbdbHwzSo6nMRUKoWIff8LWYTSjvk0/eFTngzP5nS87cUgCKqd2AlFr oH+tmxtRBShy6gJw8Zp9nZo4eMyk2jDrAkWrPRnM78WJ1XR6EgQ/xQtNEfLJDnEv wLczdzhg9Q5yxAZkcx/+NMf1kCtkSszR2f05lglLgoKhcMK45d71XWtRJaVpOHY4 y/k+avJT7I7OTTR0rEdCJ9Plb6z9tEtkcsSOD6Nk2CyaTt/UNcrRLN/oE6EKZmnm YWnmkTMKtVhM8LLi7/KzThkY8celLwEfDdh7yvZKh1pcVabf+YvSY/A/MxBwDpQS G6xbugimLDv4eY8dYtjgC3E3UYlpELOb4hfMbrJ9sbXKevLUh5HwQGLDY+psJsYx FRtACWb/MLmj8SaFFFe60DUigx6JLEJCPLanAtuo+PRIfigDRtnbSP4awctvULRY Q5bjftnbLiR7ZUvuZTaRxF2vHBV4a2EQCGIbzqzDoM5bt1cMjj60H+VKG4+2QeYR +x4lj/7gKywo9aVpPFT9ppfLR2Dyd9wZielnRBAc6QckoYVBow3eGIKZRq8F16gO kmibo1lUfcc+rW2n49dKhRme8T6eZ8yzPJUntnI

From Stefan Ram@21:1/5 to Peter J. Holzer on Sat Oct 12 11:59:52 2024

"Peter J. Holzer" <hjp-python@hjp.at> wrote or quoted:

But - without having looked at the implementation - it's far from clear
that the compiled form would be useful to the user.

So, what he might be getting at with "compiled form" is a
representation that's easy on the eyes for us mere mortals.

You could, for instance, use colors to show the difference between
object and meta characters. In that case, the regex "\**" would
come out as "**", but the first "*" might be navy blue (on a white
background), so just your run-of-the-mill object character, while
the second one would be burgundy, flagging it as a meta character.

So, simplified, that would be something like:

import re
import tkinter as tk
import time

def tokenize_regex( pattern ):
tokens = []
i = 0
while i < len( pattern ):
if pattern[ i ] == '\\':
if i + 1 < len( pattern ):
tokens.append( ( 'escaped', pattern[ i+1: i+2 ]))
i += 2
else:
tokens.append( ('error', 'Incomplete escape sequence' ))
i += 1
elif pattern[i] == '*':
tokens.append( ( 'repetition', '*' ))
i += 1
else:
tokens.append( ( 'plain', pattern[ i ]))
i += 1

return tokens

root = tk.Tk()
root.configure( bg='white' )

regex = r'\**'
result = tokenize_regex( regex )

for token_type, token_value in result:
if token_type == 'plain' or token_type == 'escaped':
tk.Label( root, text=token_value, font=( 'Arial', 40 ), fg='#4070FF', bg='white' ).pack( side='left' )
elif token_type == 'repetition':
tk.Label( root, text=token_value, font=( 'Arial', 40 ), fg='#C02000', bg='white' ).pack( side='left' )

root.mainloop()

.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to Peter J. Holzer via Python-list on Sat Oct 12 08:51:57 2024

On 10/12/2024 6:59 AM, Peter J. Holzer via Python-list wrote:

On 2024-10-11 17:13:07 -0400, AVI GROSS via Python-list wrote:

Is there some utility function out there that can be called to show what the >> regular expression you typed in will look like by the time it is ready to be >> used?

I assume that by "ready to be used" you mean the compiled form?

No, there doesn't seem to be a way to dump that. You can

p = re.compile("\\\\sout{")
print(p.pattern)

but that just prints the input string, which you could do without
compiling it first.

It prints the escaped version, so you can see if you escaped the string
as you intended. In this case, the print will display '\\sout{'. That's
worth something.

But - without having looked at the implementation - it's far from clear
that the compiled form would be useful to the user. It's probably some
kind of state machine, and a large table of state transitions isn't very readable.

There are a number of websites which visualize regular expressions.
Those are probably better for debugging a regular expression than
anything the re module could reasonably produce (although with the
caveat that such a web site would use a different implementation and therefore might produce different results).

hp

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to MRAB via Python-list on Sat Oct 12 09:06:54 2024

On 10/11/2024 8:37 PM, MRAB via Python-list wrote:

On 2024-10-11 22:13, AVI GROSS via Python-list wrote:

Is there some utility function out there that can be called to show
what the
regular expression you typed in will look like by the time it is ready
to be
used?

Obviously, life is not that simple as it can go through multiple
layers with
each dealing with a layer of backslashes.

But for simple cases, ...

Yes. It's called 'print'. :-)

There is section in the Python docs about this backslash subject. It's
titled "The Backslash Plague" in

https://docs.python.org/3/howto/regex.html

You can also inspect the compiled expression to see what string it
received after all the escaping:

import re

re_string = '\\w+\\\\sub'
re_pattern = re.compile(re_string)

# Should look as if we had used r'\w+\\sub'
print(re_pattern.pattern)

\w+\\sub

-----Original Message-----
From: Python-list <python-list-
bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Gilmeh Serda via Python-list
Sent: Friday, October 11, 2024 10:44 AM
To: python-list@python.org
Subject: Re: Correct syntax for pathological re.search()

On Mon, 7 Oct 2024 08:35:32 -0500, Michael F. Stemper wrote:

I'm trying to discard lines that include the string "\sout{" (which is
TeX, for those who are curious. I have tried:
if not re.search("\sout{", line): if not re.search("\sout\{", line): >>> if not re.search("\\sout{", line): if not re.search("\\sout\{",
line):

But the lines with that string keep coming through. What is the right
syntax to properly escape the backslash and the left curly bracket?

$ python
Python 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805] on
linux
Type "help", "copyright", "credits" or "license" for more information.

import re
s = r"testing \sout{WHADDEVVA}"
re.search(r"\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You want a literal backslash, hence, you need to escape everything.

It is not enough to escape the "\s" as "\\s", because that only takes
care
of Python's demands for escaping "\". You also need to escape the "\" for
the RegEx as well, or it will read it like it means "\s", which is the
RegEx for a space character and therefore your search doesn't match,
because it reads it like you want to search for " out{".

Therefore, you need to escape it either as per my example, or by using
four "\" and no "r" in front of the first quote, which also works:

re.search("\\\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You don't need to escape the curly braces. We call them "seagull wings"
where I live.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to Gilmeh Serda on Sun Oct 13 10:45:44 2024

Gilmeh Serda <gilmeh.serda@nothing.here.invalid> wrote or quoted:

You don't need to escape the curly braces.

Here's the 411 on some gnarly regex characters:

. matches any single character, except when it hits a new line
^ kicks things off at the start of the sequence
$ wraps it up at the end
* goes zero to infinity
+ one or more times
? maybe once, maybe not
{ starts a specific count, like {2} or {2,3}
} ends such a count
| either this or that
\ flips the script on the next character's meaning
( drops in on a group
) bails out of the group
[ paddles out to a character class
] rides the character class to shore

.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to Michael F. Stemper on Mon Oct 7 13:56:51 2024

"Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:

if not re.search("\\sout\{", line):

So, if you're not down to slap an "r" before your string literals,
you're going to end up doubling down on every backslash.

Long story short, those double backslashes in your regex?
They'll be quadrupling up in your Python string literal!

main.py

import re

lines = r'''
abcdef
\sout{abcdef
abcdef
abc\sout{def
abcdef
abcdef\sout{
abcdef
'''.strip().split( '\n' )

for line in lines:
product = re.search( "\\\\sout\\{", line )
if not product:
print( line )

stdout

abcdef
abcdef
abcdef
abcdef

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to Michael F. Stemper on Mon Oct 7 14:32:06 2024

"Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:

For now, I'll use the "r" in a cargo-cult fashion, until I decide which >syntax I prefer. (Is there any reason that one or the other is preferable?)

I'd totally go with the r-style notation!

It's got one bummer though - you can't end such a string literal with
a backslash. But hey, no biggie, you could use one of those notations:

main.py

path = r'C:\Windows\example' + '\\'

print( path )

path = r'''
C:\Windows\example\
'''.strip()

print( path )

stdout

C:\Windows\example\
C:\Windows\example\

.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jon Ribbens@21:1/5 to Stefan Ram on Mon Oct 7 15:43:59 2024

On 2024-10-07, Stefan Ram <ram@zedat.fu-berlin.de> wrote:

"Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:

For now, I'll use the "r" in a cargo-cult fashion, until I decide which >>syntax I prefer. (Is there any reason that one or the other is preferable?)

I'd totally go with the r-style notation!

It's got one bummer though - you can't end such a string literal with
a backslash. But hey, no biggie, you could use one of those notations:

main.py

path = r'C:\Windows\example' + '\\'

print( path )

path = r'''
C:\Windows\example\
'''.strip()

print( path )

stdout

C:\Windows\example\
C:\Windows\example\

.

... although of course in this example you should probably do neither of
those things, and instead do:

from pathlib import Path
path = Path(r'C:\Windows\example')

since in a Path the trailing '\' or '/' is unnecessary. Which leaves
very few remaining uses for a raw-string with a trailing '\'...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to jak via Python-list on Mon Oct 21 21:10:49 2024

On 2024-10-19 00:15:23 +0200, jak via Python-list wrote:

Peter J. Holzer ha scritto:

As a trivial example, the regular expressions r"\\sout{" and r"\\sout\{" are equivalent (the \ before the { is redundant). Yet
re.compile(s).pattern preserves the difference between the two strings.

Allow me to be fussy: r"\\sout{" and r"\\sout\{" are similar but not equivalent.

They are. Both will match the 6 character string
0005c \ REVERSE SOLIDUS
00073 s LATIN SMALL LETTER S
0006f o LATIN SMALL LETTER O
00075 u LATIN SMALL LETTER U
00074 t LATIN SMALL LETTER T
0007b { LEFT CURLY BRACKET

If you omit the backslash, the parser will have to determine if the
graph is part of regular expression {n, m} and will take more time.

Yes, that's the parser. But the result of parsing will be the same:
The string will end in a literal backslash.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmcWpzMACgkQ8g5IURL+ KF3MwA//U/od9Ba5s4anGCrIUm7uGFdaBVW7/SWr/xgGdJ2flTGfBnYPvVrK30yn 2B/GBBEn81clGhfvy2Ec7BHNl6Jz2bhe3kLzTAHCMANFXK4SwgNNM1lPLQe5MZr9 hWNUnl7fneAOmS6ySacAiI5L1cVtbWRy7D6/kjTcCdD1HdLMlY/hh6WA6Wxo/cfL b12WvPToolCd1QozzoQxgHpvqMgYq9i0vfycgavB0OG2QQlAwD5KkYBfGKqoFGoo X8TJqzH86Ofkln1RKKe+hixhvGU7Ce40H7UpECAMFMzJvXdaVzqHXCceY5f/ma0a 3PTKTia7df/1p3b47PXwDsaU3wTuAxexwNypHDn+FYmRHIjCX29oeANNOzC12/gI ToLiitDTnwR9h3n0hKgxpL2GDkHQscoLkRJWSirzFaTzwI+u/X9OLP8xj9AlaWHm WCtOisIJsva0JHoHKyG+Ycuqvgki0H4ZHc2MD3h7cx8hcodzNVI8OGFeBDpGCGqh rFfTn6pb4hF3Oc5T5qN/+eZGHuYIOoRA7/JL0Ou3XpJi4iDSCRofkGMMpUzRYi9o fa/EC9Hre0uG9j9fWlltpCcwjfnzOCPVn5Loqai2kaxJJhf7bNit5G97Lq4mCQi8 ZM0YwF+14JDKKD7gZ5qbqvEjF3ofdjMqNNoqwcS

From Stefan Ram@21:1/5 to Peter J. Holzer on Mon Oct 21 20:24:49 2024

"Peter J. Holzer" <hjp-python@hjp.at> wrote or quoted:

On 2024-10-19 00:15:23 +0200, jak via Python-list wrote:

Allow me to be fussy: r"\\sout{" and r"\\sout\{" are similar but not >>equivalent.

. . .

Yes, that's the parser. But the result of parsing will be the same:
The string will end in a literal backslash.

Functional reqs lay out what your system's got to do, while
non-functional reqs are all about time and other resource
constraints.

When you're crunching through parsing, what pops out is
your functional bread and butter.

But the time it takes to chew through that data?
That's non-functional and implementation-dependent territory.

So, we can say they're functionally equivalent.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Pieter van Oostrum@21:1/5 to Stefan Ram on Tue Oct 8 19:50:14 2024

ram@zedat.fu-berlin.de (Stefan Ram) writes:

"Michael F. Stemper" <michael.stemper@gmail.com> wrote or quoted:

path = r'C:\Windows\example' + '\\'

You could even omit the '+'. Then the concatenation is done at parsing time instead of run time.
--
Pieter van Oostrum <pieter@vanoostrum.org>
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MRAB@21:1/5 to Michael F. Stemper via Python-list on Tue Oct 8 20:11:40 2024

On 2024-10-07 14:35, Michael F. Stemper via Python-list wrote:

I'm trying to discard lines that include the string "\sout{" (which is TeX, for
those who are curious. I have tried:
if not re.search("\sout{", line):
if not re.search("\sout\{", line):
if not re.search("\\sout{", line):
if not re.search("\\sout\{", line):

But the lines with that string keep coming through. What is the right syntax to
properly escape the backslash and the left curly bracket?

String literals use backslash is an escape character, so it needs to be escaped, or you need to use a "raw" string.

However, regex also uses backslash as an escape character.

That means that a literal backslash in a regex that's in a plain string
literal needs to be doubly-escaped, once for the string literal and
again for the regex.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MRAB@21:1/5 to Karsten Hilbert via Python-list on Tue Oct 8 20:07:04 2024

On 2024-10-08 19:30, Karsten Hilbert via Python-list wrote:

Am Mon, Oct 07, 2024 at 08:35:32AM -0500 schrieb Michael F. Stemper via Python-list:

I'm trying to discard lines that include the string "\sout{" (which is TeX, for
those who are curious. I have tried:
if not re.search("\sout{", line):
if not re.search("\sout\{", line):
if not re.search("\\sout{", line):
if not re.search("\\sout\{", line):

unwanted_tex = '\sout{'
if unwanted_tex not in line: do_something_with_libreoffice()

That should be:

unwanted_tex = r'\sout{'

or:

unwanted_tex = '\\sout{'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Karsten Hilbert@21:1/5 to All on Tue Oct 8 20:30:34 2024

Am Mon, Oct 07, 2024 at 08:35:32AM -0500 schrieb Michael F. Stemper via Python-list:

I'm trying to discard lines that include the string "\sout{" (which is TeX, for
those who are curious. I have tried:
if not re.search("\sout{", line):
if not re.search("\sout\{", line):
if not re.search("\\sout{", line):
if not re.search("\\sout\{", line):

unwanted_tex = '\sout{'
if unwanted_tex not in line: do_something_with_libreoffice()

Karsten
--
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to Stefan Ram on Tue Oct 8 19:57:45 2024

ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:

"\\\\chardef \\\\\\\\ = '\\\\\\\\".

However, one can rewrite this as follows:

"`chardef `` = '``".replace( "`", "\\"*4 )

. One can also use "repr" to find how to represent something:

main.py

text = input( "What do you want me to represent as a literal? " )
print( repr( text ))

transcript

What do you want me to represent as a literal? \\sout\{
'\\\\sout\\{'

. We can use "escape" and "repr" to find how to represent
a regular expression for a literal text:

main.py

import re

text = input( "Want the literal of an re for what text? " )
print( repr( re.escape( text )))

transcript

Want the literal of an re for what text? \sout{
'\\\\sout\\{'

.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to MRAB on Tue Oct 8 19:32:04 2024

MRAB <python@mrabarnett.plus.com> wrote or quoted:

However, regex also uses backslash as an escape character.

TeX also uses the backslash as an escape character:

\chardef \\ = '\\

, the regular expression to search exactly this:

\\chardef \\\\ = '\\\\

, and the Python string literal for that regular expression:

"\\\\chardef \\\\\\\\ = '\\\\\\\\".

. Must be a reason Markdown started to use the backtick!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Alan Bawden@21:1/5 to All on Tue Oct 8 16:59:48 2024

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> tex = '\sout{'
>>> tex
'\\sout{'
>>>

Am I missing something ?

You're missing the warning it generates:

> python -E -Wonce
Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> tex = '\sout{'
<stdin>:1: DeprecationWarning: invalid escape sequence '\s'
>>>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Karsten Hilbert@21:1/5 to All on Tue Oct 8 22:17:49 2024

Am Tue, Oct 08, 2024 at 08:07:04PM +0100 schrieb MRAB via Python-list:

unwanted_tex = '\sout{'
if unwanted_tex not in line: do_something_with_libreoffice()

That should be:

unwanted_tex = r'\sout{'

Hm.

Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

tex = '\sout{'
tex

'\\sout{'

Am I missing something ?

Karsten
--
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MRAB@21:1/5 to Alan Bawden via Python-list on Tue Oct 8 23:10:03 2024

On 2024-10-08 21:59, Alan Bawden via Python-list wrote:

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> tex = '\sout{'
>>> tex
'\\sout{'
>>>

Am I missing something ?

You're missing the warning it generates:

> python -E -Wonce
Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> tex = '\sout{'
<stdin>:1: DeprecationWarning: invalid escape sequence '\s'
>>>

You got lucky that \s in invalid. If it had been \t you would've got a
tab character.

Historically, Python treated invalid escape sequences as literals, but
it's deprecated now and will become an outright error in the future
(probably) because it often hides a mistake, such as the aforementioned
\t being treated as a tab character when the user expected it to be a
literal backslash followed by letter t. (This can occur within Windows
file paths written in plain string literals.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Karsten Hilbert@21:1/5 to All on Wed Oct 9 20:06:10 2024

Am Tue, Oct 08, 2024 at 04:59:48PM -0400 schrieb Alan Bawden via Python-list:

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> tex = '\sout{'
>>> tex
'\\sout{'
>>>

Am I missing something ?

You're missing the warning it generates:

<stdin>:1: DeprecationWarning: invalid escape sequence '\s'

I knew it'd be good to ask :-D

Karsten
--
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gilmeh Serda@21:1/5 to Michael F. Stemper on Fri Oct 11 14:43:56 2024

On Mon, 7 Oct 2024 08:35:32 -0500, Michael F. Stemper wrote:

I'm trying to discard lines that include the string "\sout{" (which is
TeX, for those who are curious. I have tried:
if not re.search("\sout{", line): if not re.search("\sout\{", line):
if not re.search("\\sout{", line): if not re.search("\\sout\{",
line):

But the lines with that string keep coming through. What is the right
syntax to properly escape the backslash and the left curly bracket?

$ python
Python 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805] on linux
Type "help", "copyright", "credits" or "license" for more information.

import re
s = r"testing \sout{WHADDEVVA}"
re.search(r"\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You want a literal backslash, hence, you need to escape everything.

It is not enough to escape the "\s" as "\\s", because that only takes care
of Python's demands for escaping "\". You also need to escape the "\" for
the RegEx as well, or it will read it like it means "\s", which is the
RegEx for a space character and therefore your search doesn't match,
because it reads it like you want to search for " out{".

Therefore, you need to escape it either as per my example, or by using
four "\" and no "r" in front of the first quote, which also works:

re.search("\\\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You don't need to escape the curly braces. We call them "seagull wings"
where I live.

--
Gilmeh

Sometimes I simply feel that the whole world is a cigarette and I'm the
only ashtray.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MRAB@21:1/5 to AVI GROSS via Python-list on Sat Oct 12 01:37:55 2024

On 2024-10-11 22:13, AVI GROSS via Python-list wrote:

Is there some utility function out there that can be called to show what the regular expression you typed in will look like by the time it is ready to be used?

Obviously, life is not that simple as it can go through multiple layers with each dealing with a layer of backslashes.

But for simple cases, ...

Yes. It's called 'print'. :-)

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Gilmeh Serda via Python-list
Sent: Friday, October 11, 2024 10:44 AM
To: python-list@python.org
Subject: Re: Correct syntax for pathological re.search()

On Mon, 7 Oct 2024 08:35:32 -0500, Michael F. Stemper wrote:

I'm trying to discard lines that include the string "\sout{" (which is
TeX, for those who are curious. I have tried:
if not re.search("\sout{", line): if not re.search("\sout\{", line):
if not re.search("\\sout{", line): if not re.search("\\sout\{",
line):

But the lines with that string keep coming through. What is the right
syntax to properly escape the backslash and the left curly bracket?

$ python
Python 3.12.6 (main, Sep 8 2024, 13:18:56) [GCC 14.2.1 20240805] on linux Type "help", "copyright", "credits" or "license" for more information.

import re
s = r"testing \sout{WHADDEVVA}"
re.search(r"\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You want a literal backslash, hence, you need to escape everything.

It is not enough to escape the "\s" as "\\s", because that only takes care
of Python's demands for escaping "\". You also need to escape the "\" for
the RegEx as well, or it will read it like it means "\s", which is the
RegEx for a space character and therefore your search doesn't match,
because it reads it like you want to search for " out{".

Therefore, you need to escape it either as per my example, or by using
four "\" and no "r" in front of the first quote, which also works:

re.search("\\\\sout{", s)

<re.Match object; span=(8, 14), match='\\sout{'>

You don't need to escape the curly braces. We call them "seagull wings"
where I live.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to Thomas Passin via Python-list on Fri Oct 18 23:09:41 2024

On 2024-10-12 08:51:57 -0400, Thomas Passin via Python-list wrote:

On 10/12/2024 6:59 AM, Peter J. Holzer via Python-list wrote:

On 2024-10-11 17:13:07 -0400, AVI GROSS via Python-list wrote:

Is there some utility function out there that can be called to show what the
regular expression you typed in will look like by the time it is ready to be
used?

I assume that by "ready to be used" you mean the compiled form?

No, there doesn't seem to be a way to dump that. You can

p = re.compile("\\\\sout{")
print(p.pattern)

but that just prints the input string, which you could do without
compiling it first.

It prints the escaped version,

Did you mean the *un*escaped version? Well, yeah, that's what print
does.

so you can see if you escaped the string as you intended. In this
case, the print will display '\\sout{'.

print("\\\\sout{")
will do the same.

It seems to me that for any string s which is a valid regular expression
(i.e. re.compile doesn't throw an exception)

assert re.compile(s).pattern == s

holds.

So it doesn't give you anything you didn't already know.

As a trivial example, the regular expressions r"\\sout{" and r"\\sout\{"
are equivalent (the \ before the { is redundant). Yet
re.compile(s).pattern preserves the difference between the two strings.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmcSzo0ACgkQ8g5IURL+ KF20bg//YkA5go+I97KeDcgF5HF/zFVmsfGJar8yPBWy9RLCmZDzjKx336GCKVbo 20N7AAXrgkTyh9uUOFaTp1J0uokntWjUPLSJKMGmfleLHYFJfbBFBDtt2HlGjCpV O7QFqBH0NsmGIh3zh1ZXn4k+GUnChOJia3AeJTRJynlm4ISB5gHqp/UUj+5NSW8T D8GFQW1b2qgzuU49paKuau2qun6j+Fk6gKNIoFGM1VsGQDuxnJ47nGFrB1ntcyH4 F72Ln4GQPEeEqWO8Zyo1lle29G11bxDJ9G73xrIrDj8YEdUm5wGkdwMlGBi8MiXR PWvpoRC84K9lKGrcZKqgxu+BCUcz2AtPO1rNYduFSm6qh5kjpScAfwqdDTfiW8kf nyjddWwq0i1FMjJ9YBJ0FQ5pQAJIvsHIZs+fPnB1cmJi1CnBjDCafBzbzT8W48AN klcDwAOQJoci1GphWut5/NKuk/tbqY7CiEsYbs6sCi6omIo5fQG/rnweAkP004Ar 7vtJXgc/X/DJr29Zg4Kh88/1MJk9AkKgZGjpD0OYZVFN1cuMqJYzjYsK7L1DaGMP aomO4/vX82pfvbs7IkUfK6LJYsHt+ww39iiBAAOiEwaCVH68oGPlfdLnYdpQEAqn Ls1SMKu+UijM6ClZf6Krng9aIElAMYyC2Rr0qAA

From jak@21:1/5 to All on Sat Oct 19 00:15:23 2024

Peter J. Holzer ha scritto:

As a trivial example, the regular expressions r"\\sout{" and r"\\sout\{"
are equivalent (the \ before the { is redundant). Yet
re.compile(s).pattern preserves the difference between the two strings.

Hi,
Allow me to be fussy: r"\\sout{" and r"\\sout\{" are similar but not equivalent. If you omit the backslash, the parser will have to determine
if the graph is part of regular expression {n, m} and will take more
time. In some online regexs have these results:

r"\\sout{" : 1 match ( 7 steps, 620 μs )

r"\\sout\{" : 1 match ( 7 steps, 360 μs )

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

Recent Visitors

System Info

Re: Correct syntax for pathological re.search()