Forum: Too Lazy BBS

Re: irregular expressions, syntax complexity

From arnold@arnold@freefriends.org (Aharon Robbins) to comp.compilers on Wed Feb 22 08:53:05 2023

From Newsgroup: comp.compilers

In article <23-02-055@comp.compilers>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Regular expression syntax is missing an operator that signifies the >intersection of the sets recognized by the operand regexps. Let's
call this operator "&". ....

Has anybody implemented such an operator at all?

Doug McIlroy did, if I understand things correctly. See https://github.com/arnoldrobbins/mcilroy-regex/blob/master/README
for more info.

Arnold
--- Synchronet 3.21b-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.compilers on Wed Feb 22 10:55:21 2023

From Newsgroup: comp.compilers

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

At most one "." in front of at most one "E": [0-9]*[.]?[0-9]*E?[0-9]*
At least one of "." or "E": .*[.E].*
At least one digit in front of and after "E": .*[0-9].*E[0-9]+

There's still a bug here, because that makes "E" non-optional. Let's
make it optional:

At least one digit, when using E one before and one after: .*[0-9].*(E[0-9]+)?

In total:

[0-9]*[.]?[0-9]*E?[0-9]*&.*[.E].*&.*[0-9].*(E[0-9]+)?

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
[This doesn't seem all that much simpler than the usual approach with alternation, give or take bugs:
([0-9]+\.[0-9]*|\.[0-9]+)(E[0-9]+)?|[0-9]+E[0-9]+
-John]
--- Synchronet 3.21b-Linux NewsLink 1.2

From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Thu Feb 23 00:34:29 2023

From Newsgroup: comp.compilers

On 2023-02-21, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

You can translate regexps with & to a DFA, and don't lose any
performance there. For an NFA-based implementation, the only way I
see is that you process both sub-NFAs and check if they both accept
the string, so it's somewhat slower; I guess that this is a major
reason why such an operator has not been added to widely-used regexp syntaxes.

Has anybody implemented such an operator at all?

I implemented & in the TXR language.

It's based on Brzozowski regex derivatives (described first in 1964,
IIRC).

Derivatives allow a regex syntax with & to be interpreted directly,
and without backtracking or trying sub-cases; effectively, the approach
is equivalent to an NFA.

(Derivatives can be used for compiling also. The problem you have to
solve is deciding when two regexes are equivalent, so that you can
close loops in the state machine graph. The better you can do this, the
tighter the graph will be,)

The idea behind a "derivative" is to take an input symbol S and a regex
R, and calculate a new expression R' which matches the rest of the input
after S.

E .g. the derivativre of "abc" with respect to "a" would be "bc".
Likewise with respect to "." (match any single character).

The derivative of "abc" with respect to d would be rea, the nomatch regex
which describes/matches nothing: empty set of strings. (Not to be
confused with the empty regex that matches the empty string, denoting
the set { "" }).

To match a string of symbols, you simply take successive derivatives. If
you hit rea, you can short-circuit to the conclusion that the input
doesn't match.

When you run out of input symbols, the remaining derivative must be
nullable: nullable means capable of matching the empty string. If it
isn't, it means the regex wants more characters, so you don't have
a match.

The empty set rea regex isn't nullable, needless to say; a regex is
nullable if the language (set of strings) contains the empty string.

Anyway, you can represent regexes in Lisp syntax and then the derivative operation is just a functional syntax -> syntax calculation (which
generates a lot of garbage as we match strings).

The & operator is easily supported because the derivative operation
readily distributes into the & operatror according to an easy rule.

d(S, R1 & R2) -> d(S, R1) & d(S, R2)

Basically you take the derivatives of the branches, and combine them
with & again.

The more familiar | operator distributes the same way.

IN an actual implementation, there are tricks you can pull, like
short-circuit evaluation. If you take the d(S, R1) derivative of the
left side of R1&R2, and that turns out to be rea, you can short-circuit
to rea, If you have way to detect that a regular expression A matches all strings (like .*) then you know that A&R is equiavlent to R; you can
drop that branch, similarly A|R is A.

You can see that this is amenable to compilation, similar to the subset construction; you can calculate a derivative with respect to all
possible input symbols and build a graph in which the regex syntax
itself (all the derivatives) are the states. To close loops in the
graph, you need to detect states you have seen before, which requires equivalence test between two syntax fragments. The fewer false negatives
you have, the tighter the graph.

TXR's regex engine has a front end which checks whether the syntax
contains the "exotic operators" and chooses either the derivative-based
back end or compilation to NFA and simulation.

I haven't bothered implementing better compilation for either back end;
if a collaborator came along to explore those possibilities, that would
be great. Most of that work was done over a decade ago. There is now a
powerful Lisp dialect that would make easy work of writing a
derivative-based regex compiler.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.21b-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Geek2
  Sun May 17 07:06:15 2026
  from Euclid, Oh via Telnet
- Geek2
  Sat May 16 21:25:04 2026
  from Euclid, Oh via Telnet
- Jas Hud
  Sat May 16 00:50:28 2026
  from Bbs.Eob-Bbs.Com,wi via Telnet
- Geek2
  Fri May 15 19:53:20 2026
  from Euclid, Oh via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	65
Nodes:	6 (0 / 6)
Uptime:	08:02:47
Calls:	862
Files:	1,311
D/L today:	1 files (1,366K bytes)
Messages:	264,936

Re: irregular expressions, syntax complexity

Who's Online

Recent Visitors

System Info